2782 On The Generalization of
2782 On The Generalization of
1
On the Generalization of Equivariant Graph Neural Networks
from a theoretical perspective and derive a high probability Expressivity of geometric GNNs. Joshi et al. (2023) de-
bound for their generalization error using the PAC-learning velops a geometric Weisfeiler-Leman (GWL) test to eval-
framework (Valiant, 1984; Mohri et al., 2012). Our analysis uate the expressivity of geometric GNNs, revealing how
reveals three new insights: (i) the generalization gap de- equivariant and invariant layers affect their ability to distin-
pends on the weighted sum of logarithms of spectral norm guish complex geometric graphs. Meanwhile, Wang et al.
of weight matrices; (ii) bottom layers have higher weight (2024) introduces ViSNet, an efficient model that enhances
(aligning with common knowledge that bottom layers gener- molecular structure modeling through geometric informa-
alize better); and (iii) ε-normalization is essential to obtain tion, showcasing improved performance on molecular dy-
a bound polynomial in depth instead of exponential. We namics benchmarks.
compare this bound with existing results on Multilayer Per-
Equivariant neural networks. Equivariant Neural Net-
ceptrons (Bartlett et al., 2017; Neyshabur et al., 2018).
works have reshaped model design by embedding data sym-
Furthermore, we validate our results empirically on a real- metries directly into network architectures. Introduced by
world problem of molecular property prediction (Ramakrish- Cohen & Welling (2016), these networks ensure equivari-
nan et al., 2014), a task particularly well-suited for EGNNs. ance to group actions, a concept expanded in Euclidean
Specifically, we first establish that our theoretical bound spaces (Weiler & Cesa, 2019; Satorras et al., 2021) and Lie
highly correlates with the empirical one across different groups (Finzi et al., 2020). For 3D data, Tensor Field Net-
model hyperparameters. Second, we support our claims works (Thomas et al., 2018) and SE(3)-Transformers (Fuchs
that ε-normalization reduces the generalization gap when et al., 2020) exemplify their utility. The approach is theo-
increasing the depth of the model. retically deepened by Lang & Weiler (2021) and adapted
to dynamics with imperfect symmetries (Wang et al., 2022)
Finally, inspired by our theoretical findings, we propose a
and particle physics (Bogatskiy et al., 2020).
new regularization method. We evaluate it experimentally
and compare it with a commonly used regularization based Generalization of equivariant GNNs. Recent studies have
on the spectral norm. We find that ours leads to lower advanced our understanding of the generalization capabil-
test loss and generalization gap across different tasks and ities of equivariant GNNs. Petrache & Trivedi (2023) ex-
choices of model hyperparameters (e.g., number of layers). plore approximation-generalization trade-offs under group
equivariance, while Behboodi et al. (2022) establish a PAC-
In summary, our main contributions are: i) we derive the
Bayesian generalization bound for these networks. Bulusu
first generalization bounds for E(n)-Equivariant Graph Neu-
et al. (2021) focus on translation-equivariant neural net-
ral Networks; ii) we validate our theoretical analysis with
works, and Kondor & Trivedi (2018) consider equivariance
experiments on real-world data; iii) we show theoretically
to the action of compact groups. Additionally, Elesedy &
and empirically that ε-normalization helps generalization;
Zaidi (2021) demonstrate a strict generalization benefit for
iv) we propose a new regularization method tailored to EG-
equivariant models, and Liu et al. (2023) highlight how
NNs and assess its performance on twelve regression tasks.
physical inductive biases can enhance generalization. San-
nai et al. (2021) further provide improved generalization
bounds through quotient feature spaces.
2. Related Work
Expressivity and Generalization of GNNs. Understand- 3. Background
ing the expressivity and generalization capabilities of Graph
This section overviews E(n)-equivariant graph neural net-
Neural Networks (GNNs) is crucial for their application
works and provides definitions and results from learning
across diverse domains. Xu et al. (2019) demonstrated the
theory we leverage in our analysis. For readability, we
potential of GNNs to capture complex graph structures,
summarize the notation used in this work in Appendix A.
setting a benchmark for their expressivity. However, chal-
lenges in expressivity and generalization are highlighted
by Oono & Suzuki (2020) and Loukas (2020), who show 3.1. E(n)-Equivariant Graph Neural Network
that GNNs can lose expressive power or face limitations We consider geometric graphs and denote them by G =
based on their depth and width. Theoretical advances by (V, E, C, Z), where V is the set of vertices, E ⊆ V 2 is the
Barceló et al. (2020) and Garg et al. (2020) further dissect set of edges, C = {cv }v∈V ⊂ Rdc is an indexed set of
the logical expressiveness and representational boundaries vertex attributes (or colors), and Z = {zv }v∈V ⊂ Rdz is an
of GNNs, respectively. Recent studies, such as (Tang & indexed set of vertex coordinates. The neighborhood of a
Liu, 2023) and (Yang et al., 2023), offer new insights into vertex v is given by J (v) = {u : (u, v) ∈ E}.
GNNs’ generalization, suggesting inherent advantages of
GNN architectures over MLPs in graph learning tasks. Let G be a group acting on two sets X and X ′ by the
representations Φ and Ψ, respectively. We say a function
2
On the Generalization of Equivariant Graph Neural Networks
zv(ℓ+1) = zv(ℓ) +
EG (ℓ) (ℓ)
NN 1 X zv − zu (2)
La n ϕℓz (µ(ℓ)
u→v )
io
tat |J (v)| (ℓ) (ℓ)
ye γ(z , z , ε)
r Ro u∈J (v) u v
f : X → X ′ is G-equivariant if it commutes with the We then recursively update the embedding of vertex v using
group actions, i.e., for all g ∈ G and x ∈ X , we have that a MLP ϕℓh as
f (Φ(g)x) = Ψ(g)f (x). Here, we are interested in graph
models that are equivariant to: i) the Euclidean group E(n), h(ℓ+1)
v = ϕℓh (h(ℓ) (ℓ)
v , µv ). (4)
which comprises all translations, rotations, and reflections
of the n-dimensional Euclidean space; ii) the permutation
Remark 3.1 (ε-normalization). In their original formulation
group Σn . To represent actions of these groups on graphs,
(Satorras et al., 2021), EGNNs do not include normalization,
it is convenient to assume (WLOG) that V = {1, 2, . . . , n}, (ℓ) (ℓ)
the sets features C, Z are organized as matrices C ∈ Rn×dc , i.e. γ(zu , zv , ε) ≡ 1. In our analysis, we consider ε-
(ℓ) (ℓ) (ℓ) (ℓ)
Z ∈ Rn×dz , and the graph connectivity is given by the normalization, i.e. γ(zu , zv , ε) = ∥zu − zv ∥2 + ε.
adjacency matrix A ∈ {0, 1}n×n — in this case, we denote This variant is available in the original implementation, but
graphs as G = (A, C, Z). Thus, any g ∈ Σn can be repre- not discussed in the manuscript. We will see later in Section
sented by a corresponding permutation matrix Pg that acts 5 that ε-normalization plays a crucial role regarding the
on G by (Pg APg⊤ , Pg C, Pg Z). On the other hand, each generalization of EGNNs.
element g ∈ E(n) (or more precisely E(dz ) in our case)
In graph-level prediction tasks, we often obtain a repre-
is a translation (represented by a vector tg ∈ Rdz ×1 ) fol-
sentation for the entire graph by applying a mean readout
lowed by a linear transformation via an orthogonal matrix
function to the output of the last EGNN layer, i.e.,
Qg ∈ Rdz ×dz that acts on G by (A, C, (Z + 1n t⊤ g )Qg ),
where 1n is a n-dimensional column vector of ones. 1 X (Legnn )
hG = hv . (5)
E(n)-Equivariant GNNs (EGNNs, Satorras et al., 2021) are |V|
v∈V
arguably the most popular models for geometric graphs. The
basic idea consists of modifying message-passing GNNs Finally, we send the graph embedding hG through a final
by incorporating distances between vertices based on the MLP ϕout to achieve the graph-level prediction
geometric features into messages and recursively updating
g(G) = ϕout (hG ). (6)
the geometric features in an E(n)-equivariant fashion. This
way, EGNNs maintain the permutation equivariance of GNN Hereafter, we refer to the full mapping g(·) (i.e., EGNN +
layers while also being equivariant to E(n). final MLP) as the scoring model.
(0) (0)
Let hv = cv and zv = zv ∀v ∈ V. Also, assume that
each edge (u, v) ∈ E has an associated feature auv ∈ Rde . 3.2. Generalization bounds via Rademacher Complexity
At each layer ℓ = 0, 1, . . . , Legnn − 1, EGNNs compute The first important notion is that of generalization error (or
the incoming messages to each vertex v from its neighbors gap), defined as the difference between the expected (or
u ∈ J (v) as true) and empirical risks w.r.t. an arbitrary loss function.
1
µ(ℓ) ℓ (ℓ) (ℓ) (ℓ) (ℓ)
u→v = ϕµ hu , hv , ∥zu − zv ∥, auv , (1) github.com/vgsatorras/egnn/blob/main/
models/gcl.py, lines 203, 216
3
On the Generalization of Equivariant Graph Neural Networks
Definition 3.1 (Generalization Error). Let f : X → Y and Lemma 3.1 (Bartlett et al. (2017)). Let F ⊆ [−β, β]X be
L : Y × Y → R+ be a loss function. Let S = {(xi , yi )}m i=1 a class of functions taking values in [−β, β]. Also, assume
be a finite collection of i.i.d. samples from a distribution D that f0 ∈ F, where f0 (x) = 0 ∀x ∈ X . Define ∥f ∥∞ =
over X × Y. The generalization error of f is defined as the supx∈X ∥f (x)∥2 . Then, for any set S = {xi }m i=1 ⊆ X
difference between the expected loss and the sample loss:
4α
1 X
m b S (F) ≤ inf
R √
RS,L (f ) = E(x,y)∼D [L(f (x), y)] − L(f (xi ), yi ).
α>0 m
m i=1 √ ! (9)
Z 2β m p
12
+ log N (F, r, ∥ · ∥∞ )dr .
m α
Deriving generalization bounds for a function class often
involves obtaining some measure of its size (or capacity). In
this regard, the Empirical Rademacher Complexity (ERC) is We will analyze function classes parametrized by weight
one of the most popular tools. In particular, ERC measures matrices, and the following result regarding the bound of
how well a function class can fit random noise. the covering number of sets of matrices will be useful.
Definition 3.2 (Empirical Rademacher complexity). Let Lemma 3.2 (Chen et al. (2020)). Let W = {W ∈ Rd1 ×d2 :
F ⊂ RX be a class of bounded functions and S = ∥W ∥2 ≤ λ} be a set of matrices with bounded spectral
norm, and r > 0 a constant. The covering number of W
i=1 ⊆ X a fixed set of size m. The empirical
{xi }m
Rademacher complexity of F with respect to S is can be bounded in terms of the Frobenius norm ∥ · ∥F as
√ √ d1 d2
min{ d1 , d2 }λ
" #
m
1 X
N (W, r, ∥ · ∥F ) ≤ 1+2 .
b S (F) = Eσ sup
R σi f (xi ) , (7) r
f ∈F m i=1
(10)
where σ = [σ1 , . . . , σm ] is random vector of i.i.d. random
variables such that P(σi = 1) = P(σi = −1) = 0.5. 4. Main results
Notably, a fundamental result in learning theory bounds the To derive our generalization bound, we make the following
generalization error in terms of the ERC (Theorem 3.1). mild assumptions.
Theorem 3.1 (Mohri et al. (2012)). Let L : Y × Y → [0, 1] Assumption 4.1 (Inputs are bounded). The elements of the
be loss function and F ⊆ Y X a class of functions. Then for input graphs are contained in an Euclidean ball with a radius
any δ > 0, with probability at least 1 − δ over choosing a β. More specifically, there exists a β ≥ 1 such that for all
m-sized sample S ∼ Dm from a distribution D over X × Y, graphs G = (V, E, C, Z), and all v ∈ V and (v, u) ∈ E with
the following holds for any f ∈ F: feature vector avu , we have that
s
2
max{∥cv ∥2 , ∥zv − zu ∥2 , ∥auv ∥2 } ≤ β.
RS,L (f ) ≤ 2Rb S (FL ) + 3 log δ , (8)
2m Assumption 4.2 (EGNNs are parametrized with MLPs).
For all ℓ = 0, . . . , Legnn , the functions ϕℓz , ϕℓh , and ϕℓµ are
where MLPs with identical number of layers, denoted by Lϕ . The
scoring model ϕout is also an MLP with Lout layers. In
FL = {(x, y) 7→ L(f (x), y) : f ∈ F}. addition, all activation functions ψ(·) are Kψ -Lipschitz, for
some Kψ ≥ 1, and ψ(0) = 0.
From Theorem 3.1, finding a generalization bound reduces Assumption 4.3 (Weights have bounded spectral norm).
to bounding the ERC. We will use standard tools for bound- Let Wiϕ denote the weight matrix of the i-th (linear) layer
ing ERC, for which we need to introduce a concept of a Legnn
of the MLP ϕ. For all layers i of the MLPs ϕ ∈ {ϕℓz }ℓ=0 ∪
covering number. L L
{ϕh }ℓ=0 ∪ {ϕµ }ℓ=0 ∪ {ϕout }, there exists a βi,ϕ ≥ 1 such
ℓ egnn ℓ egnn
4
On the Generalization of Equivariant Graph Neural Networks
Our results
∥ · ∥ any norm. Then the ε-normalized function
Background
Lemma 4.1 Lemma 4.2
f (·)
-norm. preserves
Covering number
the generalization bounds for the EGNN model, namely
Lipschitz continuity of the EGNN node embeddings with
Scoring model is
for matrices
Lipschitz wrt parameters
remain bounded as long as inputs are bounded: D, C, β (ℓ) , M (ℓ) are as defined in Lemma 4.1, and
Lemma 4.1 (Boundedness of EGNN embeddings). Con-
C
sider an EGNN model as described in Equations (1)-(4). Q = 224D2 max ,1
ε
For any graph G, we have that ∀v ∈ V, and u ∈ J (v):
ℓ
!3 ℓ−1 (14)
Y Y
max{∥h(ℓ) (ℓ) (ℓ) (ℓ)
v ∥2 , ∥zv −zu ∥2 , D∥µu→v ∥2 } ≤ Cβ (ℓ)
, (11) B (ℓ) = M (i) β (i) ,
i=0 i=0
where D is the maximum degree in the graph, C = 8Dβ, with the convention that an empty product is equal to 1.
Q 2
ℓ
and β (ℓ) = (20D)ℓ i=0 M (i)
with
The proof is in Appendix E.
Lϕ Finally, we can show that the scoring model defined in
(12)
Y
M (ℓ)
= max Kψ βi,ϕ Equation 6 is also Lipschitz continuous w.r.t. its parameters.
ϕ∈{ϕℓ−1 ,ϕℓ−1 ,ϕℓµ } i=1
Lemma 4.4 (Lipschitz continuity w.r.t. parameters of the
h z
5
On the Generalization of Equivariant Graph Neural Networks
qP
QL L ∥Wi ∥2F
MLPs (2018)
p p
d log(d) i=1 ∥Wi ∥2 i=1 ∥Wi ∥22 L log(L)
4.2. Generalization bounds qP
QL L ∥Wi ∥2F
G-EGNNs (2022)
p p
d log(d) ∥Wi ∥2 L log(L)
We have developed all the necessary machinery to derive the i=1 i=1 ∥Wi ∥22
6
On the Generalization of Equivariant Graph Neural Networks
layers learn more abstract, high-level information, which We observe that the EGNN bound enjoys a better depen-
makes them generalize better as opposed to top layers learn- dence on the node degree D, the spectral norm of the
ing more fine-grained, task-specific information. weights W , and possibly better (but certainly not worse)
on the node depth d. However, it has a worse dependence
Second, the dependence on the spectral norms is logarithmic.
on the depth L. We leave further in-depth analysis of the
This implies that the bound is less sensitive to outliers than
impact of E(n)-equivariance for future work.
e.g., the sum of spectral norms. Furthermore, from the
definition of κ (Equation 15), we see that weight matrices
with spectral norms lower than 1 are ignored, implying that Table 2. Comparison with message-passing GNNs (MP-GNNs)
decreasing the norms further yields no additional gains. (Garg et al., 2020). The a, b, c subscripts denote different model
Dependence on the training set. We note that the spectral variants. All values are given in Big-O notation.
norms of the weight matrices depend on the (size of the) d (width) D (node degree) W (weights) L (depth)
√ √ √
training set, and in general, one cannot decouple this de- MP-GNNa
p
d log d D log D κ(W )3 log κ(W ) log L
pendence for an already trained network, e.g., we cannot MP-GNNb
√
d log d
√
D log D
p
κ(W )3 log κ(W )
√
L log L
assume the spectral norm would not change as we vary the √ √ √
MP-GNNc
p
d d log d D log D κ(W )3 log κ(W ) L log L
number of examples while evaluating the effect of sample
√ √ √
complexity (i.e., size of the training set size) on generaliza- EGNNs (ours)
p
d log d log D log κ(W ) L L log L
tion. The dependence of the spectral norm on the training
set, learning algorithm, etc. was not considered in this study.
Spectral regularization. Multiple regularization methods 6. Experiments
relying on the spectral norm have been proposed, including This section substantiates our theoretical analysis with ex-
the sum of squared spectral norms (Yoshida & Miyato, 2017) periments on real-world regression tasks. In particular, we
or explicitly enforcing the spectral norm to be 1 (Miyato first consider the generalization behavior of EGNNs as a
et al., 2018). We propose a new regularization method function of the model variables (e.g., number of layers). We
that leverages the findings unique to our study, i.e. explicit demonstrate that our bounds highly correlate with empir-
minimization of the generalization gap: ical generalization gaps. We also empirically assess the
beneficial impact of the ε-normalization on generalization,
X
R(W) = wi,ϕ log(κ(Wiϕ )). (17)
i,ϕ validating our findings. Lastly, we show the efficacy of regu-
larized EGNNs obtained from our analysis, highlighting per-
Since log is concave, this can intuitively be considered an
formance gains compared to another regularization method
example of concave regularization (Zhang & Zhang, 2012),
and empirical risk minimizers (no regularization). Our code
which generally favors sparse solutions. To see this, note
is available at https://github.com/Aalto-QuML/
that the derivative of log is x1 , and therefore gradient-based
GeneralizationEGNNs.
optimization will prioritize decreasing norms of weight
matrices, whose norm is the lowest. However, note that Evaluation setup. We consider four molecular property
in our context, instead of log we use log ◦κ, which is prediction tasks from the QM9 dataset (Ramakrishnan et al.,
no longer concave in the entire domain. We empirically 2014). From the data split available in (Satorras et al., 2021;
evaluate R in Section 6. Anderson et al., 2019), we select a subset of 2K molecules
for training and use the entire val/test partitions with ap-
Impact of equivariance. Since the EGNN network without
proximately 17K and 13K molecules, respectively. We train
the equivariant update (Equation 2) reduces to a regular
all models for 1000 epochs using the Adam optimizer. We
message passing GNN, we can compare our bound to the
run five independent trials with different seeds. We provide
Rademacher bound for message passing GNNs (Garg et al.,
further implementation details in Appendix J.
2020). To make that comparison possible, we need to make
the following simplifications to our setting: we assume that Empirical vs. theoretical generalization gaps. To demon-
(1) each MLP has only a single hidden layer; (2) different strate the practical relevance of the bounds extracted from
GNN layers share the same weight matrices; and (3) there our analysis, we contrast the empirical and theoretical gen-
is no output MLP ϕout . eralization gaps in terms of: i) the spectral norms across
epochs; ii) the maximum number of hidden units (hidden
We include three variants of the MP-GNN bound: MP-
dim, d); and iii) the number of EGNN layers (Legnn ). We
GNNa , MP-GNNb , and MP-GNNc , because Garg et al.
also report Pearson correlation coefficients between the gen-
(2020) report a different dependence on various parameters,
eralization curves. Figure 3 shows the results (mean and
depending on the value of the percolation compexity, which
standard deviation) over five runs.
depends on the Lipschitz constants of non-linearities and
the spectral norm. See Table 2 for the comparison. Regarding the spectral norm (top row in Figure 3), we ob-
7
On the Generalization of Equivariant Graph Neural Networks
HOM O ∆ µ LU M O
0.3 0.2
0.2 0.4
Generalization
0.2
0.2
0.2
0.1 0.1
0.1
ρ: 0.93±0.04 ρ: 0.96±0.03 ρ: 0.96±0.02 ρ: 0.98±0.01
0.0 0.0 0.0 0.0
4 8 16 24 32 4 8 16 24 32 4 8 16 24 32 4 8 16 24 32
# hidden dim, d # hidden dim, d # hidden dim, d # hidden dim, d
0.4 0.3
0.2
Generalization
0.2
0.2
0.2 0.1
0.1
0.1
ρ: 0.93±0.04 ρ: 0.96±0.04 ρ: 0.89±0.13 ρ: 0.95±0.08
0.0 0.0 0.0 0.0
1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7
# layers, Legnn # layers, Legnn # layers, Legnn # layers, Legnn
Figure 3. Empirical vs. theoretical generalization gaps as a function of the spectral norms over epochs (top), width (middle), and number
of EGNN layers (bottom). Overall, our bounds (black curves) highly correlate with the empirical generalization gap (blue curves). The
variables εHOMO , ∆ε, µ, and εLUMO denote molecular properties on the QM9 dataset.
-norm. 0.10 {1, 3, 5, 7}. Figure 4 reports bar plots for each property. As
0.05 our theory suggests, ε-normalization positively affects gen-
0.05
eralization, especially as we increase the number of layers.
0.00
0.00 Spectral norm regularization. To assess the effectiveness
µ LU M O of our bound on the spectral norm as a regularizer
0.3 (Equation 17), we compare it against a baseline called
Generalization gap
0.2
S PEC AVG that takes the average of the spectral norm over
0.2
all layers as the regularization term. Using the average
0.1
0.1 (or sum) of spectral norms (or their square) is a common
practice (Yoshida & Miyato, 2017). We also report results
0.0
1 3 5 7
0.0
1 3 5 7 without regularization. Again, we consider regression tasks
Legnn Legnn from QM9. For both regularizers, we select the optimal
penalization factor λ ∈ {1, 1e-3, 1e-5, 1e-7} using the
Figure 4. Impact of ε-normalization in terms of the number of validation set. Details are given in Appendix J.
layers. Using ε-normalization yields smaller generalization gaps Table 3 shows the results in terms of test error (MSE)
as the depth (Legnn ) increases.
and generalization gap for different numbers of layers and
tasks. Remarkably, our method significantly outperforms
serve that our bound can capture the trend of the empirical S PEC AVG in virtually all cases. Also, our approach was
gap, which settles down at epochs 500 (εHOMO and µ) or able to produce both smaller test errors and generalization
250 (∆ε and εLUMO ). In 3 out of 4 tasks, the average cor- gaps than the baselines. Overall, S PEC AVG achieves better
relation is greater than 0.8. We can observe an even better results than models without regularization.
correlation when looking at the dependence on hidden di-
mensionality and number of layers — in 7 out of 8 cases, the Additional results. In Appendix K, we provide additional
correlation is over 0.9. These outcomes support our theory. visualizations and experiments. More specifically, we show
8
On the Generalization of Equivariant Graph Neural Networks
Table 3. Test mean squared error (MSE) and generalization gap on QM9 for different regularization methods. ‘None’ denotes EGNNs
without regularization. We denote the best-performing methods in bold. In almost all cases, employing the method derived from our
theoretical analysis leads to the smallest test errors and generalization gaps. Results are multiplied by 100 for better visualization.
Test MSE Generalization gap
Task Legnn None S PEC AVG Ours None S PEC AVG Ours
3 5.72 ± 0.48 5.43 ± 0.37 5.76 ± 0.51 3.61 ± 0.50 3.33 ± 0.49 3.71 ± 0.48
εHOMO 5 9.66 ± 1.32 8.45 ± 1.28 6.71 ± 0.48 8.77 ± 1.40 7.29 ± 1.42 3.71 ± 1.56
7 11.13 ± 0.73 9.25 ± 0.70 7.90 ± 0.50 10.45 ± 0.90 8.33 ± 1.02 4.43 ± 1.12
3 12.40 ± 1.34 12.36 ± 1.34 11.61 ± 1.09 8.02 ± 1.52 7.94 ± 1.46 6.00 ± 1.90
∆ε 5 25.42 ± 7.53 25.38 ± 7.77 17.39 ± 1.37 22.92 ± 6.87 22.91 ± 7.04 10.22 ± 5.17
7 28.49 ± 4.35 26.28 ± 5.00 23.93 ± 2.73 27.46 ± 3.83 25.14 ± 4.34 18.16 ± 7.39
3 30.56 ± 1.18 30.03 ± 0.98 29.56 ± 0.47 8.11 ± 1.96 7.78 ± 1.47 5.76 ± 1.76
µ 5 30.94 ± 1.95 30.48 ± 2.26 28.43 ± 0.26 12.91 ± 2.27 11.80 ± 1.38 8.47 ± 2.85
7 31.73 ± 2.46 31.35 ± 2.59 28.86 ± 0.73 18.42 ± 1.71 17.40 ± 1.66 9.79 ± 4.45
3 5.68 ± 0.45 5.68 ± 0.45 5.11 ± 0.36 3.07 ± 0.53 3.07 ± 0.53 2.46 ± 0.41
εLUMO 5 7.51 ± 1.76 6.74 ± 1.07 5.85 ± 0.56 6.01 ± 1.68 5.17 ± 0.90 3.75 ± 1.07
7 7.87 ± 1.03 7.45 ± 0.73 7.16 ± 0.90 7.03 ± 1.17 6.71 ± 0.88 5.58 ± 2.29
the behavior of generalization gaps over training to illustrate (Flagship program). CSC – IT Center for Science, Finland,
that gaps from our method decrease with the values of λ, as provided computational support for this work. We thank the
expected. In contrast, S PEC AVG has little effect in reducing anonymous reviewers for their constructive feedback.
the gap compared to non-regularized models for most
penalization values. We also compare our regularizer with Impact Statement
S PEC AVG on other eight QM9 regression tasks. Overall,
our method achieves smaller generalization gaps and com- Incorporating symmetries such as equivariance in deep
petitive MSE values. Lastly, we report the average time per learning models is crucial for advancing fields like physics-
epoch of our regularizer for different tasks to show its com- inspired learning, drug and material discovery, protein de-
putational overhead. For the largest model, the regularized sign, and molecular dynamics simulation. Good generaliza-
vs. non-regularized time ratio is approximately 1.5. tion performance is a key goal of machine learning mod-
els. This work investigates the generalization of equivariant
7. Conclusion graph neural networks, providing several insights that we
hope would foster the design of better models. We do not
In this work, we analyzed the generalization capabilities of foresee any particularly negative societal impact of this
E(n)-Equivariant Graph Neural Networks (EGNNs) from work.
a theoretical perspective. We provided high probability
bounds of the generalization error and discussed in detail References
its implications. Specifically, its logarithmic dependence on
the spectral norm of weight matrices, the uneven impact of Anderson, B., Hy, T. S., and Kondor, R. Cormorant: Covari-
different layers, and the importance of ε-normalization. We ant molecular neural networks. In Advances in Neural
performed extensive experiments to validate our theory. Information Processing Systems (NeurIPS), 2019.
Additionally, we proposed a novel regularization method Arts, M., Garcia Satorras, V., Huang, C.-W., Zugner, D.,
inspired by our theoretical results. We show experimentally Federici, M., Clementi, C., Noé, F., Pinsler, R., and
that it helps reduce both the test loss and the generalization van den Berg, R. Two for one: Diffusion models and force
gap and performs better than the commonly used sum of fields for coarse-grained molecular dynamics. Journal of
spectral norms of weight matrices. Chemical Theory and Computation, 19(18):6151–6159,
2023.
Acknowledgements
Barceló, P., Kostylev, E. V., Monet, M., Pérez, J., Reutter, J.,
This work has been supported by the Jane and Aatos Erkko and Silva, J.-P. The logical expressiveness of graph neural
Foundation project (grant 7001703) on “Biodesign: Use of networks. In 8th International Conference on Learning
artificial intelligence in enzyme design for synthetic biol- Representations (ICLR 2020), 2020.
ogy”, and the Finnish Center for Artificial Intelligence FCAI
Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-
9
On the Generalization of Equivariant Graph Neural Networks
normalized margin bounds for neural networks. In Guyon, Fu, T., Gao, W., Coley, C., and Sun, J. Reinforced genetic
I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., algorithm for structure-based drug design. Advances
Vishwanathan, S., and Garnett, R. (eds.), Advances in in Neural Information Processing Systems, 35:12325–
Neural Information Processing Systems, volume 30. Cur- 12338, 2022.
ran Associates, Inc., 2017.
Fuchs, F., Worrall, D., Fischer, V., and Welling, M. Se
Behboodi, A., Cesa, G., and Cohen, T. S. A pac-bayesian (3)-transformers: 3d roto-translation equivariant attention
generalization bound for equivariant networks. Advances networks. Advances in neural information processing
in Neural Information Processing Systems, 35:5654– systems, 33:1970–1981, 2020.
5668, 2022.
Fukushima, K. Neocognitron: A self-organizing neural
Bogatskiy, A., Anderson, B., Offermann, J., Roussi, M., network model for a mechanism of pattern recognition
Miller, D., and Kondor, R. Lorentz group equivariant unaffected by shift in position. Biological Cybernetics,
neural network for particle physics. In International 36(4), 1980.
Conference on Machine Learning, pp. 992–1002, 2020.
Garcia Satorras, V., Hoogeboom, E., Fuchs, F., Posner, I.,
Bronstein, M. M., Bruna, J., Cohen, T., and Velickovic, and Welling, M. E(n) equivariant normalizing flows.
P. Geometric deep learning: Grids, groups, graphs, In Advances in Neural Information Processing Systems,
geodesics, and gauges. ArXiv e-prints: 2104.13478, volume 34, pp. 4181–4192. Curran Associates, Inc., 2021.
2021.
Garg, V. K., Jegelka, S., and Jaakkola, T. Generalization
Bulusu, S., Favoni, M., Ipp, A., Müller, D. I., and Schuh, D.
and representational limits of graph neural networks. In
Generalization capabilities of translationally equivariant
International Conference on Machine Learning, 2020.
neural networks. Physical Review D, 104(7):074504,
2021. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and
Dahl, G. E. Neural message passing for quantum chem-
Chen, M., Li, X., and Zhao, T. On generalization bounds of
istry. In International Conference on Machine Learning
a family of recurrent neural networks. In Proceedings of
(ICML), 2017.
the Twenty Third International Conference on Artificial
Intelligence and Statistics, pp. 1233–1243, 2020.
Gori, M., Monfardini, G., and Scarselli, F. A new model for
Cohen, T. and Welling, M. Group equivariant convolutional learning in graph domains. In IEEE International Joint
networks. In International conference on machine learn- Conference on Neural Networks (IJCNN), 2005.
ing, pp. 2990–2999. PMLR, 2016.
Joshi, C. K., Bodnar, C., Mathis, S. V., Cohen, T., and
Cohen, T. S., Geiger, M., Köhler, J., and Welling, M. Spher- Liò, P. On the expressive power of geometric graph
ical cnns. In International Conference on Learning Rep- neural networks. In International Conference on Machine
resentations (ICLR), 2018. Learning, 2023.
Derrow-Pinion, A., She, J., Wong, D., Lange, O., Hester, Kipf, T. N. and Welling, M. Semi-supervised classifica-
T., Perez, L., Nunkesser, M., Lee, S., Guo, X., Wiltshire, tion with graph convolutional networks. In International
B., Battaglia, P. W., Gupta, V., Li, A., Xu, Z., Sanchez- Conference on Learning Representations (ICLR), 2017.
Gonzalez, A., Li, Y., and Velickovic, P. Eta prediction
with graph neural networks in google maps. In Pro- Kondor, R. and Trivedi, S. On the generalization of equivari-
ceedings of the 30th ACM International Conference on ance and convolution in neural networks to the action of
Information & Knowledge Management, pp. 3767–3776, compact groups. In International Conference on Machine
2021. Learning, pp. 2747–2755, 2018.
Elesedy, B. and Zaidi, S. Provably strict generalisation ben- Lang, L. and Weiler, M. A wigner-eckart theorem for group
efit for equivariant models. In International Conference equivariant convolution kernels. In International Confer-
on Machine Learning, pp. 2959–2969. PMLR, 2021. ence on Learning Representations, 2021.
Finzi, M., Stanton, S., Izmailov, P., and Wilson, A. G. Gen- LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,
eralizing convolutional neural networks for equivariance R. E., Hubbard, W., and Jackel, L. D. Handwritten digit
to lie groups on arbitrary continuous data. In Interna- recognition with a back-propagation network. In Ad-
tional Conference on Machine Learning, pp. 3165–3176, vances in Neural Information Processing Systems (NIPS),
2020. 1990.
10
On the Generalization of Equivariant Graph Neural Networks
Liu, Y., Cheng, J., Zhao, H., Xu, T., Zhao, P., Tsung, F., Li, Sannai, A., Imaizumi, M., and Kawano, M. Improved
J., and Rong, Y. Improving generalization in equivariant generalization bounds of group invariant/equivariant deep
graph neural networks with physical inductive biases. networks via quotient feature spaces. In Uncertainty in
In The Twelfth International Conference on Learning Artificial Intelligence, pp. 771–780, 2021.
Representations, 2023.
Sato, R., Yamada, M., and Kashima, H. Approximation ra-
Loukas, A. What graph neural networks cannot learn: depth tios of graph neural networks for combinatorial problems.
vs width. In International Conference on Learning Rep- In Advances in Neural Information Processing Systems
resentations, 2020. (NeurIPS), 2019.
Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Satorras, V. G., Hoogeboom, E., and Welling, M. E(n) equiv-
Y. Provably powerful graph networks. In Advances in ariant graph neural networks. In International Conference
Neural Information Processing Systems (NeurIPS), 2019. on Machine Learning, 2021.
Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec- Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
tral normalization for generative adversarial networks. In Monfardini, G. The graph neural network model. IEEE
International Conference on Learning Representations, Transactions on Neural Networks, 20(1):61–80, 2009.
2018.
Stärk, H., Ganea, O., Pattanaik, L., Barzilay, D., and
Mohri, M., Rostamizadeh, A., and Talwalkar, A. Founda- Jaakkola, T. EquiBind: Geometric deep learning for drug
tions of Machine Learning. The MIT Press, 2012. ISBN binding structure prediction. In Proceedings of the 39th
9780262018258. URL http://www.jstor.org/ International Conference on Machine Learning, volume
stable/j.ctt5hhcw1. 162, pp. 20503–20521, 2022.
Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz,
J. E., Rattan, G., and Grohe, M. Weisfeiler and leman A., Donghia, N. M., MacNair, C. R., French, S., Carfrae,
go neural: Higher-order graph neural networks. In AAAI L. A., Bloom-Ackermann, Z., Tran, V. M., Chiappino-
Conference on Artificial Intelligence (AAAI), 2019. Pepe, A., Badran, A. H., Andrews, I. W., Chory, E. J.,
Church, G. M., Brown, E. D., Jaakkola, T. S., Barzilay, R.,
Neyshabur, B., Bhojanapalli, S., and Srebro, N. A and Collins, J. J. A deep learning approach to antibiotic
PAC-bayesian approach to spectrally-normalized margin discovery. Cell, 180(4):688 – 702, 2020.
bounds for neural networks. In International Conference
on Learning Representations, 2018. Tang, H. and Liu, Y. Towards understanding generalization
of graph neural networks. In Proceedings of the 40th
Oono, K. and Suzuki, T. Graph neural networks exponen- International Conference on Machine Learning, volume
tially lose expressive power for node classification. In 202, pp. 33674–33719, 23–29 Jul 2023.
International Conference on Learning Representations,
2020. Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L.,
Kohlhoff, K., and Riley, P. Tensor field networks:
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Rotation-and translation-equivariant neural networks for
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.
A. Automatic differentiation in pytorch. In Advances
in Neural Information Processing Systems (NeurIPS - Valiant, L. G. A theory of the learnable. Communications
Workshop), 2017. of the ACM, 27(11):1134–1142, 1984.
Petrache, M. and Trivedi, S. Approximation-generalization Verma, Y., Heinonen, M., and Garg, V. ClimODE: Cli-
trade-offs under (approximate) group equivariance, 2023. mate and weather forecasting with physics-informed
neural ODEs. In The Twelfth International Confer-
Ramakrishnan, R., Dral, P. O., Rupp, M., and von Lilienfeld, ence on Learning Representations, 2024. URL https:
O. A. Quantum chemistry structures and properties of //openreview.net/forum?id=xuY33XhEGR.
134 kilo molecules. Scientific Data, 1(1), 2014.
Wang, R., Walters, R., and Yu, R. Approximately equiv-
Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., ariant networks for imperfectly symmetric dynamics. In
Leskovec, J., and Battaglia, P. Learning to simulate Proceedings of the 39th International Conference on Ma-
complex physics with graph networks. In International chine Learning, volume 162, pp. 23078–23091. PMLR,
Conference on Machine Learning (ICML), 2020. 2022.
11
On the Generalization of Equivariant Graph Neural Networks
Wang, Y., Wang, T., Li, S., He, X., Li, M., Wang, Z., Zheng,
N., Shao, B., and Liu, T.-Y. Enhancing geometric repre-
sentations for molecules with equivariant vector-scalar
interactive message passing. Nature Communications, 15
(1):313, 2024.
Weiler, M. and Cesa, G. General e (2)-equivariant steerable
cnns. Advances in neural information processing systems,
32, 2019.
Weisfeiler, B. and Lehman, A. A. A reduction of a graph
to a canonical form and an algebra arising during this
reduction. Nauchno-Technicheskaya Informatsia, 2(9):
12–16, 1968.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? In International Conference
on Learning Representations (ICLR), 2019.
Xu, Y. and Wang, X. Understanding weight normalized deep
neural networks with rectified linear units. In Bengio, S.,
Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi,
N., and Garnett, R. (eds.), Advances in Neural Informa-
tion Processing Systems, volume 31. Curran Associates,
Inc., 2018.
Yang, C., Wu, Q., Wang, J., and Yan, J. Graph neural
networks are inherently good generalizers: Insights by
bridging gnns and mlps. In International Conference on
Learning Representations (ICLR), 2023.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,
W. L., and Leskovec, J. Graph convolutional neural
networks for web-scale recommender systems. In In-
ternational Conference on Knowledge Discovery & Data
Mining (KDD), 2018.
Yoshida, Y. and Miyato, T. Spectral norm regularization for
improving the generalizability of deep learning. ArXiv,
abs/1705.10941, 2017.
Zhang, C.-H. and Zhang, T. A general theory of concave reg-
ularization for high-dimensional sparse estimation prob-
lems. Statistical Science, 27(4):576–593, 2012.
12
On the Generalization of Equivariant Graph Neural Networks
A. Notation
Table 4: Summary of notation and abbreviations.
Notation Description
L loss function
G = (V, E, C, Z) Geometric graph
V Set of nodes in the graph
E ⊆V 2
Neighbourhood structure
C Set of node features
Z Set of node coordinates
f ∈F f is a function from the function class F
S, m S is a training set consisting of m input-output samples, i.e., S = {(xi , yi )}m
i=1
ϕ = fW ϕ ◦ · · · ◦ fW ϕ ,
Lϕ 1
ϕk = fW ϕ ◦ · · · ◦ fW ϕ
k 1
ϕ(x) = ϕ(x; Wϕ )
L
for Wϕ = {Wiϕ }i=1
ϕ
whenever we want to emphasize which weight matrices were used.
13
On the Generalization of Equivariant Graph Neural Networks
One can augment the input space by appending a constant 1 to x to obtain x̃ and redefining the weight matrix as W̃ =
concat(W, b). Thus the function can be expressed as
Proof.
∥ϕ(x)∥2 = ∥ϕLϕ (x)∥2 = ∥ψ(WLϕϕ ϕLϕ −1 (x))∥2 = ∥ψ(WLϕϕ ϕLϕ −1 (x)) − ψ(0)∥2 ≤ Kψ ∥WLϕϕ ϕLϕ −1 (x)∥2
Lϕ
Y
≤ Kψ ∥WLϕϕ ∥2 ∥ϕLϕ −1 (x)∥2 ≤ · · · ≤ ∥x∥2 Kψ ∥Wiϕ ∥2 .
i=1
Lemma B.2 (Lipschitz continuity of MLP w.r.t. model parameters). Consider two Lϕ -layer MLPs given by two sets of
parameters: Wϕ and W̃ϕ sharing the same activation function ψ with Lipschitz constant Kψ and ψ(0) = 0. Assume further
that all weight matrices have bounded norm, i.e. for all i, max{∥Wiϕ ∥2 , ∥W̃iϕ ∥2 } ≤ βi,ϕ for some βi,ϕ ≥ 1. Then for all
x, Wϕ , W̃ϕ :
Lϕ
Y
∥ϕ(x; Wϕ ) − ϕ(x; W̃ϕ )∥2 ≤ ∥x∥2 Kψ βi,ϕ dist(Wϕ , W̃ϕ ), (18)
i=1
where
Lϕ
X
dist(Wϕ , W̃ϕ ) := ∥Wiϕ − W̃iϕ ∥2 .
i=1
Proof. For presentation clarity, we assume throughout this proof ∥∥ ≡ ∥∥2 - spectral norm for matrices and Euclidean norm
for vectors.
∥ϕ(x; Wϕ ) − ϕ(x; W̃ϕ )∥ = ∥ψ(WLϕϕ ϕLϕ −1 (x; Wϕ )) − ψ(W̃Lϕϕ ϕLϕ −1 (x; W̃ϕ ))∥
≤ Kψ ∥WLϕϕ ϕLϕ −1 (x; Wϕ ) − W̃Lϕϕ ϕLϕ −1 (x; W̃ϕ )∥.
14
On the Generalization of Equivariant Graph Neural Networks
Therefore
Lϕ −1
L
Y
∥ϕ(x; Wϕ ) − ϕ(x; W̃ϕ )∥ ≤ ∥x∥Kψ ϕ ∥W̃iϕ ∥ ∥WLϕϕ − W̃Lϕϕ ∥ + Kψ ∥WLϕϕ ∥ · ∥ϕLϕ −1 (x; Wϕ ) − ϕLϕ −1 (x; W̃ϕ )∥
i=1
where
i−1 Lϕ Lϕ
Y Y Y
ai = ∥W̃jϕ ∥ ∥Wjϕ′ ∥ ≤ βj,ϕ .
j=1 j ′ =i+1 j=1
Lemma B.3 (Lipschitz continuity of MLP w.r.t. model input). Let ϕ be a Lϕ -layer MLP with a Kψ -Lipschitz activation
function ψ. Then ϕ is Lipschitz continuous with respect to the input:
Lϕ
Y
∥ϕ(x) − ϕ(y)∥2 ≤ Kψ ∥Wiϕ ∥2 ∥x − y∥2 .
i=1
∥ϕ(x) − ϕ(y)∥ = ∥ϕLϕ (x) − ϕLϕ (y)∥ ≤ Kψ ∥WLϕϕ ∥ · ∥ϕLϕ −1 (x) − ϕLϕ −1 (y)∥
Lϕ
Y
≤ · · · ≤ Kψ ∥Wiϕ ∥ ∥x − y∥.
i=1
Lemma B.4 (Multiple inputs MLP). Suppose ϕ, a Lϕ -layer MLP takes multiple inputs x1 , . . . , xK by concatenating them
into a single vector x = CONCAT([x1 , . . . , xK ]). Then the output is bounded:
Lϕ K
√ Y X
∥ϕ(x1 , . . . , xK )∥2 ≤ K Kψ ∥Wiϕ ∥2 ∥xi ∥2 .
i=1 i=1
15
On the Generalization of Equivariant Graph Neural Networks
Q 2
ℓ
where D is the maximum degree in the graph, C = 8Dβ, and β (ℓ) = (20D)ℓ i=0 M (i) with
Lϕ
(12)
Y
M (ℓ) = max Kψ βi,ϕ
ϕ∈{ϕℓ−1
h ,ϕℓ−1
z ,ϕℓµ } i=1
where βi,ϕ is the bound of ∥Wϕi ∥2 , and Lϕ , the number of layers of ϕ, and Kψ the Lipschitz constant of the activation
function.
(ℓ)
Proof. In the proof we assume ∥ · ∥ ≡ ∥ · ∥2 . First note that (11) implies that for all v and ℓ: ∥µv ∥ ≤ Cβ (ℓ) due to the
(ℓ) (ℓ)
definition of µv as a sum aggregation of µu→v . We will show the result by induction. Note that, from Lemmas B.3 and
B.4: 1
∥µ(0) 0 (0) (0) (0) (0)
u→v ∥ = ∥ϕµ hu , hv , ∥zu − zv ∥, auv ∥ ≤ 2(2β + β + β)M
(0)
≤ CK (0)
D
and thus establishing the base case of ℓ = 0. Now assume that (11) holds for ℓ′ < ℓ.
From Lemmas B.3, B.4 and the inductive hypothesis:
√ √
∥h(ℓ)
v ∥≤ 2M (ℓ) (∥h(ℓ−1)
v ∥ + ∥µv(ℓ−1) ∥) ≤ 2 2M (ℓ) Cβ (ℓ−1) ≤ Cβ (ℓ) ,
√ 2
where the last inequality holds because 2 2 ≤ 20D and M (ℓ) ≤ M (ℓ) . Similarly for z:
and therefore:
∥zu(ℓ) − zv(ℓ) ∥ ≤ Cβ (ℓ−1) + 2M (ℓ) Cβ (ℓ−1) ≤ 3M (ℓ) Cβ (ℓ−1) ≤ Cβ (ℓ) ,
because 3 ≤ 20D. Finally for µ:
∥µ(ℓ)
u→v ∥ ≤ 2M (ℓ)
∥h(ℓ)
v ∥ + ∥h(ℓ)
u ∥ + ∥z (ℓ)
v − zu
(ℓ)
∥ + ∥auv ∥
√
≤ 2M (ℓ) 4 2M (ℓ) + 1 + 2M (ℓ) + 1 Cβ (ℓ−1)
2 1
≤ 20 M (ℓ) Cβ (ℓ−1) = Cβ (ℓ) ,
D
√
where the last inequality holds because 4 + 4 2 ≤ 10.
16
On the Generalization of Equivariant Graph Neural Networks
Proof.
f (x) f (y) (∥f (y)∥ + ε)f (x) − (∥f (x)∥ + ε)f (y)
∥fε (x) − fε (y)∥ = − =
∥f (x)∥ + ε ∥f (y)∥ + ε (∥f (x)∥ + ε)(∥f (y)∥ + ε)
ε ∥f (y)∥f (x) − ∥f (x)∥f (y)
≤ ∥f (x) − f (y)∥ +
(∥f (x)∥ + ε)(∥f (y)∥ + ε) (∥f (x)∥ + ε)(∥f (y)∥ + ε)
Lf ∥f (y)∥f (x) − ∥f (x)∥f (y)
≤ ∥x − y∥ + .
ε (∥f (x)∥ + ε)(∥f (y)∥ + ε)
Focusing on the numerator of the last term:
∥∥f (y)∥f (x) − ∥f (x)∥f (y)∥ = ∥∥f (y)∥f (x) − ∥f (x)∥f (x) + ∥f (x)∥f (x) − ∥f (x)∥f (y)∥
≤ |∥f (y)∥ − ∥f (x)∥| · ∥f (x)∥ + ∥f (x)∥ · ∥f (x) − f (y)∥
≤ ∥f (x)∥ (Lf ∥x − y∥ + |∥f (x)∥ − ∥f (y)∥|)
≤ 2∥f (x)∥Lf ∥x − y∥,
where in the last step we used the fact that |∥a∥ − ∥b∥| ≤ ∥a − b∥. Therefore:
∥f (y)∥f (x) − ∥f (x)∥f (y) 2∥f (x)∥Lf ∥x − y∥ 2Lf
≤ ≤ ∥x − y∥
(∥f (x)∥ + ε)(∥f (y)∥ + ε) (∥f (x)∥ + ε)(∥f (y)∥ + ε) ε
and
3Lf
∥fε (x) − fε (y)∥ ≤ ∥x − y∥.
ε
where
ℓ Lϕ
X X X
distℓ (W, W̃) = ∥Wiϕ − W̃iϕ ∥2 ,
ℓ′ =1 ′ ℓ′ −1 i=1
ϕ∈{ϕzℓ −1 ,ϕh ,ϕℓµ′ }
Proof. Throughout the proof we assume ∥ · ∥ ≡ ∥ · ∥2 . We will show the result by induction. We will show a stronger
(ℓ) (ℓ) (ℓ) (0) (0)
statement, namely that all hv , zv and µu→v are Lipschitz continuous. The base case obviously holds for hv and zv as
they do not depend on model parameters.
17
On the Generalization of Equivariant Graph Neural Networks
We see from Lemma B.2 and our assumption that inputs are bounded:
1 1
∥µ(0) (0)
u→v (W) − µu→v (W̃)∥ ≤ M
(0)
8βdist(Wϕ0µ , W̃ϕ0µ ) ≤ CM (0) dist(Wϕ0µ , W̃ϕ0µ ) ≤ CM (0) dist0 (W, W̃) (20)
D D
(ℓ′ ) (ℓ′ ) (ℓ′ )
and thus establishing the base case for induction. Assume now that for all ℓ′ < ℓ there exist Kh , Kz , Kµ such that for
all u, v and all W, W̃:
′ ′ (ℓ′ )
∥h(ℓ ) (ℓ )
v (W) − hv (W̃)∥ ≤ Kh distℓ′ (W, W̃)
′ ′ ′
∥zv(ℓ ) (W) − zv(ℓ ) (W̃)∥ ≤ Kz(ℓ ) distℓ′ (W, W̃) (21)
′
(ℓ′ ) 1 (ℓ′ )
∥µ(ℓ )
u→v (W) − µu→v (W̃)∥ ≤ K distℓ′ (W, W̃).
D µ
(ℓ′ ) (ℓ′ ) (ℓ′ )
Note that (21) implies that ∥µv (W) − µv (W̃)∥ ≤ Kµ distℓ′ (W, W̃). We start with h and proceed with a telescoping
argument:
∥h(ℓ) (ℓ)
v (W) − hv (W̃)∥
= ϕℓ−1 (ℓ−1)
h (hv (W), µ(ℓ−1)
v (W); Wϕℓ−1 ) − ϕhℓ−1 (h(ℓ−1)
v (W̃), µ(ℓ−1)
v (W̃); W̃ϕℓ−1 )
h h
≤ ϕhℓ−1 (h(ℓ−1)
v (W), µ(ℓ−1)
v (W); Wϕℓ−1 ) − ϕhℓ−1 (hv(ℓ−1) (W), µ(ℓ−1)
v (W); W̃ϕℓ−1 )
h h
+ ϕℓ−1 (ℓ−1)
h (hv (W), µ(ℓ−1)
v (W); W̃ϕℓ−1 ) − ϕhℓ−1 (hv(ℓ−1) (W̃), µ(ℓ−1)
v (W̃); W̃ϕℓ−1 )
h h
Now we use Lemmas B.2, B.4 and 4.1 to bound the first term and Lemma B.3 to bound the second
∥h(ℓ) (ℓ)
v (W) − hv (W̃)∥
√
≤ 2M (ℓ) (∥h(ℓ−1)
v (W)∥ + ∥µ(ℓ−1)
v (W)∥)dist(Wϕℓ−1 , W̃ϕℓ−1 )
h h
√ (ℓ)
(ℓ−1) (ℓ−1) (ℓ−1)
+ 2M ∥hv (W) − hv (W̃)∥ + ∥µv (W) − µ(ℓ−1)
v (W̃)∥ (22)
√ √ (ℓ−1)
≤ 2 2M (ℓ) Cβ (ℓ−1) dist(Wϕℓ−1 , W̃ϕℓ−1 ) + 2M (ℓ) (Kh + Kµ(ℓ−1) )distℓ−1 (W, W̃)
h h
√ (ℓ−1)
≤ 2M (ℓ) (2Cβ (ℓ−1) + Kh + Kµ(ℓ−1) )distℓ (W, W̃).
Moving on to z:
18
On the Generalization of Equivariant Graph Neural Networks
zv(ℓ−1) −zu(ℓ−1)
6Kz(ℓ−1)
(ℓ−1) (ℓ−1) is Lipschitz continuous with a constant ε and we can bound:
∥zv −zu ∥+ε
and consequently:
!
(ℓ−1)
6Kz
∥zv(ℓ) (W) − zv(ℓ) (W̃)∥ ≤ Kz(ℓ−1) + M (ℓ) (ℓ−1)
∥µu→v (W̃)∥ distℓ−1 (W, W̃)
ε
(23)
1 X
+ ∥ϕℓ−1 (ℓ−1)
z (µu→v (W); Wϕℓ−1 ) − ϕℓ−1 (ℓ−1)
z (µu→v (W̃); W̃ϕℓ−1 )∥.
|J (v)| z z
u∈J (v)
(ℓ−1)
Now from Lemma 4.1 we have ∥µu→v (W)∥ ≤ Cβ (ℓ−1) and by applying (24) to (23), we obtain:
!
(ℓ−1)
6Kz
∥zv(ℓ) (W) − zv(ℓ) (W̃)∥ ≤ Kz(ℓ−1) + M (ℓ)
Cβ (ℓ−1)
+M (ℓ)
Cβ (ℓ−1)
+M (ℓ)
Kµ(ℓ−1) distℓ (W, W̃)
ε
!
(ℓ−1)
(ℓ) (ℓ−1) 6Kz
≤M Cβ + Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
distℓ (W, W̃)
ε
(ℓ)
µu→v (W) − µ(ℓ)
u→v (W̃) = ϕ ℓ
µ ω (ℓ)
uv (W); Wϕµ ℓ − ϕ ℓ
µ ω (ℓ)
uv ( W̃); W̃ ℓ
ϕµ
= ϕℓµ ωuv
(ℓ)
(W); Wϕℓµ − ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ + ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ − ϕℓµ ωuv(ℓ)
(W̃); W̃ϕℓµ
≤ ϕℓµ ωuv
(ℓ)
(W); Wϕℓµ − ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ + ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ − ϕℓµ ωuv(ℓ)
(W̃); W̃ϕℓµ
≤ M (ℓ) ∥ωuv
(ℓ)
(W)∥dist(Wϕℓµ , W̃ϕℓµ ) + M (ℓ) ∥ωuv
(ℓ) (ℓ)
(W) − ωuv (W̃)∥.
(ℓ)
∥ωuv (W)∥ ≤ 2 ∥h(ℓ)
u (W)∥ + ∥h (ℓ)
v (W)∥ + ∥zu
(ℓ)
(W) − zv
(ℓ)
(W)∥ + ∥auv ∥
2 (25)
≤ 8Cβ (ℓ) = 160D M (ℓ) Cβ (ℓ−1)
19
On the Generalization of Equivariant Graph Neural Networks
(ℓ) (ℓ)
∥ωuv (W) − ωuv (W̃)∥
≤ 2 ∥h(ℓ)
u (W) − h(ℓ)
u (W̃)∥ + ∥h (ℓ)
v (W) − h (ℓ)
v (W̃)∥
+ 2 ∥zu(ℓ) (W) − zv(ℓ) (W)∥ − ∥zu(ℓ) (W̃) − zv(ℓ) (W̃)∥∥
≤ 2 ∥hu(ℓ) (W) − h(ℓ) u (W̃)∥ + ∥h (ℓ)
v (W) − h (ℓ)
v (W̃)∥ + ∥zu
(ℓ)
(W) − zu
(ℓ)
(W̃)∥ + ∥z (ℓ)
v (W) − zv
(ℓ)
(W̃)∥
√ (ℓ) (ℓ−1) (ℓ−1) (ℓ−1)
≤ 4 2M (2Cβ + Kh + Kµ )distℓ (W, W̃)
!
(ℓ−1)
6K z
+ 4M (ℓ) Cβ (ℓ−1) + Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1) distℓ (W, W̃)
ε
!
(ℓ−1)
(ℓ) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) 6Kz (ℓ−1)
≤ 4M 4Cβ + 2Kh + 3Kµ + Kz + Cβ distℓ (W, W̃)
ε
and therefore:
3
µ(ℓ)
u→v (W) − µ(ℓ)
u→v ( W̃) ≤ 160D M (ℓ)
Cβ (ℓ−1) distℓ (W, W̃)
(26)
!
2 (ℓ−1)
(ℓ) (ℓ−1) (ℓ−1) 6Kz
+4 M 4Cβ + 2Kh + 3Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
distℓ (W, W̃).
ε
(ℓ)
√ (ℓ−1)
Kh = 2M (ℓ) (2Cβ (ℓ−1) + Kh + Kµ(ℓ−1) )
!
(ℓ−1)
6Kz
Kz(ℓ) =M (ℓ)
Cβ (ℓ−1)
+ Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
ε
1 (ℓ) 3 (27)
Kµ = 160D M (ℓ) Cβ (ℓ−1)
D !
2 (ℓ−1)
(ℓ−1) 6K z
+ 4 M (ℓ) 4Cβ (ℓ−1) + 2Kh + 3Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
ε
We now determine the growth rates of the constants. Specifically, we will now show by induction that
(ℓ)
Kh ≤ CQℓ B (ℓ)
Kµ(ℓ) ≤ CQℓ B (ℓ) (28)
Kz(ℓ) ≤ CQ B ℓ (ℓ)
,
Q 3 Q
ℓ ℓ−1 (i)
where Q = 224D2 max{ Cε , 1} and B (ℓ) = i=0 M (i) i=0 β with the convention that empty product is equal to 1.
(0) (0) (0)
To show the base case, we recall that Kh = Kz = 0 and Kµ = CM (0) . Now, assuming that (28) holds for all ℓ′ < ℓ,
20
On the Generalization of Equivariant Graph Neural Networks
we will show that it also holds for ℓ. Using the fact that M (ℓ) ≥ 1, β (ℓ) ≥ 1 and B (ℓ) ≥ 1 for all ℓ, we obtain:
(ℓ)
√ √ 3
Kh ≤ 2M (l) 2Cβ (ℓ−1) + 2CQℓ−1 B (ℓ−1) ≤ 4 2CQℓ−1 M (ℓ) β (ℓ−1) B (ℓ−1)
√
C
≤ 4 2 max , 1 CQℓ−1 B (ℓ) ≤ CQℓ B (ℓ)
ε
6
Kz(ℓ) ≤ M (ℓ) Cβ (ℓ−1) + 2CQℓ−1 B (ℓ−1) + CQℓ−1 B (ℓ−1) Cβ (ℓ−1)
ε
6C 3 C
≤ 1+2+ CQℓ−1 M (ℓ) β (ℓ−1) B (ℓ−1) ≤ 9 max , 1 CQℓ−1 B (ℓ) ≤ CQℓ B (ℓ) (29)
ε ε
3 2 6
Kµ(ℓ) ≤ 160D2 M (ℓ) Cβ (ℓ−1) + 4D M (ℓ) 4Cβ (ℓ−1) + 6CQℓ−1 B (ℓ−1) + Cβ (ℓ−1) CQℓ−1 B (ℓ−1)
ε
3
24DC
≤ 160D2 + 16D + 24D + CQℓ−1 M (ℓ) β (ℓ−1) B (ℓ−1)
ε
C
≤ 160D2 + 64D max , 1 CQℓ−1 B (ℓ) ≤ CQℓ B (ℓ)
ε
where
Lout
X
dist(W (g) , W̃ (g) ) = distLegnn (W, W̃) + ∥Wiϕout − W̃iϕout ∥2
i=1
(·)
C, Q, B are as defined in Lemma 4.3 and
Lout
!
Y
Kg = 2 Kψ βi,ϕout CQLegnn B (Legnn ) .
i=1
(Legnn )
Proof. In the proof we assume ∥ · ∥ ≡ ∥ · ∥2 . From Equation 5: hG = 1
and therefore hG retain Lipschitz
P
|V| v∈V hv
(L )
continuity of each individual hv egnn with the same constant. It holds that:
∥g(G; W (g) ) − g(G; W̃ (g) )∥ = ∥ϕout (hG (W); Wϕout ) − ϕout (hG (W̃); W̃ϕout )∥
≤ ∥ϕout (hG (W); Wϕout ) − ϕout (hG (W); W̃ϕout )∥ + ∥ϕout (hG (W); W̃ϕout ) − ϕout (hG (W̃); W̃ϕout )∥
Lout Lout
! !
Y Y
≤ Kψ βi,ϕout ∥hG (W)∥dist(Wϕout , W̃ϕout ) + Kψ βi,ϕout ∥hG (W) − hG (W̃)∥
i=1 i=1
Lout
!
Y
≤ Kψ βi,ϕout C(β (Legnn ) + QLegnn B (Legnn ) )dist(W (g) , W̃ (g) )
i=1
Lout
!
Y
≤2 Kψ βi,ϕout CQLegnn B (Legnn ) dist(W (g) , W̃ (g) ).
i=1
21
On the Generalization of Equivariant Graph Neural Networks
Then
n
Y ε
N (F, ε, ∥ · ∥∞ ) ≤ N Wi , , ∥ · ∥F .
i=1
nK
Proof. Let Ci be an ε
-covering of Wi such that |Ci | = N Wi , nK
ε
, ∥ · ∥F and define
nK
C = {fW1 ,...,Wn : Wi ∈ Ci } .
Let fW1 ,...,Wn ∈ F. For all i we can choose Ŵi ∈ Ci such that ∥Wi − Ŵi ∥F ≤ nK .
ε
Therefore
∥fW1 ,...,Wn − fŴ1 ,...,Ŵn ∥∞ ≤ K ∥W1 − Ŵ1 ∥F + · · · + ∥Wn − Ŵn ∥F ≤ ε
where L = 3Lϕ Legnn + Lout is the total number of weight matrices, d is the maximum width across all layers, and
Legnn Lϕ Lout
X X X X
∆= γi,ϕ + γi,ϕout
ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1
h z ,ϕℓµ } i=1 i=1
Proof. From Theorem 3.1 we can see that in order to bound RS,L it suffices to find a bound for R
b S (FL ), where
22
On the Generalization of Equivariant Graph Neural Networks
and
F = {g(·, W (g) ) : ∀ℓ ∀ϕ ∈ {ϕℓh , ϕℓz , ϕℓµ , ϕout } ∀i ∥Wiϕ ∥ ≤ βi,ϕ }
is the family of scoring models with bounded spectral norm of weight matrices. The primary tool that we will use for
bounding R b S (FL ) is Lemma 3.1. However, to use it directly, the assumption f0 ∈ FL would need to be satisfied, which is
very restrictive as it assumes the existence of a perfect classifier in the model class. Instead, we observe that
where
F̂L = {(x, y) 7→ 1 − L(f (x), y) : f ∈ F}.
It may seem tautological, but now there exists fˆ0 ∈ F̂L , because we can take f0 ≡ −1 and since y ∈ [0, 1], it holds that
∀(x, y) fˆ0 (x, y) := 1 − L(f0 (x), y) = 0. We can thus use Lemma 3.1 for F̂L and obtain
√ !
Z 2 m
4α 12
q
b S (F̂L ) ≤ inf
R √ + log N (F̂L , r, ∥ · ∥∞ )dr .
α>0 m m α
Now since (x, y) 7→ 1 − L(x, y) is 2-Lipschitz in its first argument, an r/2 covering of F yields an r covering of F̂L and
therefore
r
log N (F̂L , r, ∥ · ∥) ≤ log N (F, , ∥ · ∥∞ ).
2
Furthermore, from Equation (30), one can easily check that R b S (F̂L ) and in summary it holds that
b S (FL ) = R
√ !
2 m
r
r
Z
4α 12
b S (FL ) ≤ inf
R √ + log N (F, , ∥ · ∥∞ )dr .
α>0 m m α 2
By choosing α = √1 ,
m
we obtain a simplified bound
s
b S (FL ) ≤ 4 + √24
R log N
1
F, √ , ∥ · ∥∞ . (31)
m m 2 m
We now move on to finding a bound for log N F, 2√1m , ∥ · ∥∞ . From Lemma 4.4:
∥g(·; W (g) ) − g(·; W̃ (g) )∥∞ = sup ∥g(G; W (g) ) − g(G; W̃ (g) )∥2
G
23
On the Generalization of Equivariant Graph Neural Networks
Q
Lout
where Kg = 2 i=1 Kψ i,ϕout CQ
β Legnn (Legnn )
B and the last inequality comes from the relationship between the spectral
and Frobenius norms.
Thus,
for L = 3Legnn Lϕ + Lout (total number of weight matrices), we can use Lemma G.1, i.e. that it
suffices to find an r
LKg -covering for each weight matrix Wiϕ and their cartesian product yields an r-covering of F:
Legnn Lϕ Lout
X X X r X r
log N (F, r, ∥ · ∥∞ ) ≤ log N Wiϕ , , ∥ · ∥F + log N Wiϕout , , ∥ · ∥F .
LKg LKg
ℓ=0 ϕ∈{ϕℓ−1 ,ϕzℓ−1 ,ϕℓ } i=1 i=1
h µ
Now, since the spectral norm of weight matrices is bounded, we can use Lemma 3.2 and obtain
√ !
ϕ r 2 dLKg βi,ϕ
log N Wi , , ∥ · ∥F ≤ d log 1 + 2 ,
LKg r
L ϕ L
where d = maxℓ=0
egnn
maxϕ∈{ϕℓ−1 ,ϕℓ−1
z ,ϕℓ ,ϕout } maxi=1 dim(Wi ) is the width of the scoring model and dim(W ) =
ϕ
h µ
(32)
and plugging into (31) gives
v
uLegnn Lϕ √ Lout √
4 24d u X X X X
RS (FL ) ≤
b + √ u log 5 dmLKg βi,ϕ + log 5 dmLKg βi,ϕout . (33)
m muu
t ℓ=0 ϕ∈{ϕhℓ−1 ,ϕzℓ−1 ,ϕℓµ } i=1 i=1
| {z }
=:Σ
and
Lout
X
log(Kg ) = log(2) + log(Kψ βi,ϕout ) + log(C) + Legnn log(Q) + log(B (Legnn ) )
i=1
Lout
X
= log(16) + log(Dβ) + log(Kψ βi,ϕout )
i=1
2 8Dβ
+ Legnn log 224D max ,1
ε
Legnn Legnn −1
ℓ
!2
X X Y
+3 log(M (ℓ) ) + log (20D)ℓ M (i)
ℓ=0 ℓ=0 i=0
out L Legnn ℓ
Legnn (Legnn − 1) X
Dβ XX
≤ C1 log + log(Kψ βi,ϕout ) + 3 log(M (i) ),
ε̂ 2 i=1 i=0 ℓ=0
24
On the Generalization of Equivariant Graph Neural Networks
where C1 does not depend on the parameters of the model and ε̂ = min{1, ε}.
QLϕ
Since M (ℓ) = maxϕ∈{ϕℓ−1 ,ϕzℓ−1 ,ϕℓ } i=1 Kψ βi,ϕ , we have
h µ
Lϕ
X X
log(M (i) ) ≤ log(Kψ βj,ϕ )
ϕ∈{ϕi−1
h ,ϕi−1
z ,ϕiµ } j=1
and
out L
Legnn (Legnn − 1) X
Dβ
log(Kg ) ≤ C2 log + log(Kψ βi,ϕout )
ε̂ 2 i=1
Legnn Lϕ
X X X
+3 (Legnn − ℓ + 1) log(Kψ βi,ϕ ).
ℓ=0 ϕ∈{ϕℓ−1
h ,ϕℓ−1
z ,ϕℓµ } i=1
In summary:
Legnn Lϕ Lout
X X X X
Σ ≤ C3 L wi,ϕ log (Γβi,ϕ ) + wi,ϕout log (Γβi,ϕout ) ,
ℓ=0 ϕ∈{ϕℓ−1 ,ϕzℓ−1 ,ϕℓ } i=1 i=1
h µ
where
(
1 for ϕ = ϕout
wi,ϕ =
Legnn − ℓ + 1 for ϕ ∈ {ϕℓh , ϕℓz , ϕℓµ }
dmLDβKψ
Γ= .
ε̂
We can now use Equation 33 and Theorem 3.1 to obtain the bound for the generalization error of the scoring model g with
probability at least 1 − δ: s
8 48d √ log 2δ
RS,L (g) ≤ +√ Σ+3
m m 2m
Furthermore, for a given g(·; W), we observe that the stretching factor as defined in Equation 15 can serve as the bound, i.e.
we can set βi,ϕ = κ(Wiϕ ) and get
√ u
v s
Legnn Lϕ Lout
8 48d L u X X X X log 2δ
RS,L (g) ≤ + C3 √ u γi,ϕ + γi,ϕout + 3
m m 2m
t
ℓ−1 ℓ−1
ℓ=0 ϕ∈{ϕ i=1,ϕz i=1
,ϕℓµ }
h
where
25
On the Generalization of Equivariant Graph Neural Networks
min{d1 , d2 }λ
d1 ×d2
log N A∈R : ∥A∥2 ≤ λ , ε, ∥ · ∥F ≤ d1 d2 log 1 + 2 .
ε
This is in contrast with perhaps a more commonly used result (Lemma 10 in (Chen et al., 2020), special case of Lemma 3.2
in (Bartlett et al., 2017)), which states:
λ2
A ∈ Rd1 ×d2 : ∥A∥2,1 ≤ λ , ε, ∥ · ∥2 ≤ 2 log(2d1 d2 ).
log N
ε
The former yields a logarithmic dependence on the norm, while the latter produces a multiplicative one. However, these
are not directly comparable. Firstly, there is a tradeoff between the dependence on the norm and the width of the network
d = max{d1 , d2 }. The log of the covering norm in the former yields logarithmic dependence on the norm but quadratic on
the width, and it is the other way around in the latter. In particular, for large values of λ and low values of d, the former may
be preferred, whereas the latter would be better for low values of λ and high values of d.
Furthermore, these two lemmas provide bounds on the covering numbers of sets of matrices. Still, the former one assumes
that the spectral norm is bounded and gives a bound w.r.t. Frobenius norm, while the latter assumes that ∥ · ∥2,1 norm is
bounded and gives a bound w.r.t. spectral norm, which makes the comparison difficult.
I. Impact of ε-normalization
In section 5 we argue that ε-normalization plays an important role in the derivation of the generalization bound. Here,
we provide more details to support this claim. Specifically, suppose we follow the same reasoning as we did in Lemmas
(ℓ) (ℓ)
4.1-4.4, for the unnormalized EGNN model, i.e. γ(zv , zu , ε) ≡ 1 in Equation (2). Assume further that ∀i, ϕ, it holds that
∥Mi,ϕ ∥2 ≤ βϕ .
(l)
Consider the bound of the ℓ-layer EGNN embeddings β (ℓ) , i.e. ∥hv ∥2 ≤ β (ℓ) . From Lemma 4.1, we know that for
β (ℓ) = O(C1ℓ ) for some C1 > 1 in the ε−normalized EGNN model. However, from Equation (19), we can see that β (ℓ)
would satisfy the following recursive relationship:
which implies
ℓ
β (ℓ) = O(C22 )
for some C2 > 1. The super-exponential growth of β (ℓ) leads to a super-exponential Lipshitz constant of the unnormalized
EGNN and therefore an exponential dependence of the generalization gap on the number of EGNN layers.
J. Implementation details
The QM9 dataset (Ramakrishnan et al. (2014)) comprises small molecules with a maximum of 29 atoms in 3D space. Each
atom is defined by a 3D position and a five-dimensional one-hot node embedding that represents the atom type, indicated as
(H, C, N, O, F). The dataset has several different chemical properties (outputs), from which we arbitrarily chose 4 to assess
the generalization of EGNNs. We report the results for the remaining properties in Appendix K.
We implement all models using the PyTorch (Paszke et al., 2017) and train them for 1000 epochs using the Adam optimizer
and MSE loss function. We use batch size equal to 96 and cosine decay for the learning rate starting at 10−3 except for the
Mu (µ) property, where the initial value was set to 5 × 10−4 . We run five independent trials with different seeds.
For the experiments in Figure 3, we use width d = 64 (for all layers) and Legnn = 5 for the experiments regarding the
spectral norm, Legnn = 3 for the one regarding the width, and width d = 16 for assessing the generalization gap in terms
26
On the Generalization of Equivariant Graph Neural Networks
of the number of layers. We apply ε-normalization with ε = 1. All internal MLPs (ϕh , ϕz , ϕµ , ϕout ) have two layers with
SiLU activation function. Following the original repo, we use the identity activation function at the last layer of ϕz . For the
experiments on the impact ε-normalization, we use ε = 1 and d = 128 (width).
We note that in their original, Satorras et al. (2021) do not update the positional features z over the layers on the experiments
using QM9. In our experiments, we consider full EGNN models.
For the experiments in Table 3, we use width d = 16, ε-normalization (ε = 1), and internal MLPs with two layers (i.e.,
Lϕ = Lout = 2) with SiLU activation functions. We consider 2K samples for training, and full validation and test sets. The
statistics are obtained from three independent runs, with different seeds. We perform model selection for the regularization
factor λ ∈ {1, 1e-3, 1e-5, 1e-7} using the validation set.
HOM O ∆ µ LU M O
0.20 0.08
0.100 0.3 λ: 1.0
Generalization gap
Figure 5. Generalization gap over training epochs for the regularization method proposed in this work. Circles at the end of training
denote the final test MSE error.
HOM O ∆ µ LU M O
0.4
0.15 0.4
λ: 1.0
Generalization gap
Figure 6. Generalization gap over training epochs for S PEC AVG. Circles at the end of training denote the final test MSE error.
Time comparison. We note that regularization does not affect inference time as it only applies during training. The impact
of the proposed method on training is reported in the Table 5. For the largest model (7 layers), the regularized version incurs
a computational overhead of approximately 50% — i.e., the regularized vs. non-regularized time ratio is around 1.5.
Additional QM9 properties. Table 6 reports results (from three independent runs) for the remaining eight QM9 tasks.
For these experiments, we used width d = 32, cosine decay with an initial learning rate of 10−3 , batch size of 96, and 2000
epochs. Overall, we observe that SpecAvg and our regularizer achieve similar Test MSE values, while our method obtains
smaller generalization gaps.
27
On the Generalization of Equivariant Graph Neural Networks
Table 6. Test mean squared error (MSE) and generalization gap on additional QM9 tasks for different regularization methods. We denote
the best-performing methods in bold. Results are multiplied by 100 for better visualization.
Test MSE Generalization gap
Task Legnn None S PEC AVG Ours None S PEC AVG Ours
3 42.87 ± 0.99 42.72 ± 0.88 41.76 ± 1.11 29.29 ± 17.10 29.05 ± 16.87 16.46 ± 5.65
α 5 41.47 ± 0.92 41.31 ± 0.40 41.24 ± 1.30 34.79 ± 10.96 34.34 ± 11.37 34.41 ± 11.19
7 44.11 ± 1.97 43.10 ± 0.93 41.91 ± 2.27 38.98 ± 5.76 28.67 ± 14.88 18.78 ± 22.20
3 7.59 ± 1.38 7.10 ± 1.71 7.09 ± 1.63 7.04 ± 1.64 6.32 ± 2.36 6.49 ± 1.91
Cv 5 12.11 ± 1.00 11.32 ± 1.03 11.28 ± 1.21 12.09 ± 1.01 11.30 ± 1.02 10.03 ± 1.57
7 10.10 ± 2.28 10.24 ± 2.45 9.08 ± 0.51 10.08 ± 2.28 10.23 ± 2.45 9.04 ± 0.48
3 16.60 ± 7.35 15.71 ± 6.62 16.00 ± 6.08 15.17 ± 6.91 13.91 ± 5.81 10.74 ± 3.43
G 5 21.88 ± 9.19 20.63 ± 7.84 21.55 ± 5.24 21.57 ± 9.00 19.93 ± 7.18 17.26 ± 4.95
7 28.92 ± 1.89 26.89 ± 3.39 26.11 ± 4.67 28.83 ± 1.94 26.29 ± 4.01 22.84 ± 9.23
3 15.49 ± 8.10 14.91 ± 6.92 15.19 ± 4.33 13.23 ± 7.48 12.36 ± 5.74 12.97 ± 3.85
H 5 22.60 ± 8.35 22.39 ± 7.58 23.00 ± 4.53 22.20 ± 8.11 21.95 ± 7.28 11.67 ± 5.77
7 25.75 ± 0.59 26.81 ± 2.47 26.91 ± 5.16 25.64 ± 0.60 26.67 ± 2.61 14.67 ± 9.72
3 56.63 ± 4.99 56.62 ± 4.99 56.65 ± 4.94 3.80 ± 2.29 3.81 ± 2.31 3.79 ± 2.31
r2 5 55.31 ± 1.07 55.22 ± 1.24 55.53 ± 1.17 5.03 ± 1.45 5.02 ± 1.45 4.95 ± 1.63
7 58.28 ± 3.76 58.25 ± 3.78 57.61 ± 2.50 6.86 ± 3.36 7.20 ± 3.02 5.22 ± 0.54
3 15.80 ± 7.34 15.46 ± 6.59 16.44 ± 3.67 13.59 ± 6.58 12.82 ± 5.15 14.03 ± 2.68
U 5 22.22 ± 6.94 19.66 ± 7.46 23.34 ± 4.77 21.84 ± 6.68 19.39 ± 7.16 17.75 ± 7.28
7 27.43 ± 1.49 25.63 ± 0.90 26.18 ± 2.99 27.31 ± 1.52 25.09 ± 1.77 22.98 ± 8.18
3 15.70 ± 7.18 15.10 ± 5.99 15.87 ± 6.51 13.62 ± 6.32 12.98 ± 5.08 13.80 ± 5.80
U0 5 22.03 ± 8.80 21.66 ± 8.44 22.79 ± 4.57 21.74 ± 8.63 21.30 ± 8.22 12.67 ± 5.19
7 26.81 ± 3.12 21.71 ± 9.39 21.33 ± 9.74 26.71 ± 3.20 21.64 ± 9.46 21.13 ± 9.86
3 94.14 ± 0.26 94.25 ± 0.29 94.21 ± 0.45 2.17 ± 0.81 2.13 ± 0.92 1.81 ± 1.06
zpve 5 95.29 ± 0.41 94.82 ± 0.25 95.23 ± 0.29 7.18 ± 2.95 1.67 ± 0.50 3.82 ± 2.67
7 95.74 ± 0.32 95.03 ± 0.29 95.14 ± 0.22 19.62 ± 13.81 4.95 ± 3.97 2.03 ± 0.97
28