0% found this document useful (0 votes)
131 views33 pages

Matter Gen

Uploaded by

rfrjoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views33 pages

Matter Gen

Uploaded by

rfrjoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Variational Autoencoder and Latent Space Diffusion for

Accelerated Material Discovery

Ryan Rezaei

Canyon Crest Academy

Abstract

Advancements in materials science have driven technological progress, yet discovering

new materials remains challenging due to the vast, unexplored chemical space. Traditional

methods, relying on trial-and-error and computational screening, explore only a fraction of

viable candidates and are resource-intensive. Generative machine learning models offer a

promising alternative by autonomously navigating chemical spaces to identify novel, stable

materials. However, existing models often struggle to produce thermodynamically stable struc-

tures and have limited applicability.

In this work, we present a framework that combines a Variational Autoencoder (VAE) with

a latent diffusion model to overcome these limitations. Our VAE, utilizing adapted geometric

deep learning architectures DimeNet++ and GemNet, encodes critical geometric and chemi-

cal properties—including lattice parameters, atomic species, and coordinates—into a compact

latent space that respects crystallographic symmetries and material invariances such as permu-

tation, translation, rotation, and periodicity. We implement a fourier time-conditioned latent

diffusion process enhanced with rectified flow sampling, based on the denoising formulation in

Stable Diffusion 3, with a modified GemNet architecture for time conditioning. This enables

controlled exploration of stable material configurations while maintaining geometric equivari-

ances inherent in crystalline structures.

1
Leveraging a curated dataset of 1.5 million materials from the Materials Project and Alexan-

dria databases, our model generates thousands of novel, DFT-verified stable crystalline mate-

rials. By enabling property-targeted generation through integrated MLPs trained on this new

dataset, we may further direct the model’s discovery of functional materials for sustainable en-

ergy, electronics, and environmental technologies, revolutionizing the field of materials science

and making material discovery thousands of times more efficient.

1 Introduction

Materials science is a cornerstone of modern technology, underpinning advancements across a


wide range of fields including electronics, energy, transportation, and healthcare. The development
of new materials has historically driven significant technological leaps, from the silicon revolution
in semiconductors to the advent of high-performance batteries for electric vehicles. Discovering
novel materials is therefore critical for addressing some of the most pressing challenges of our
time, such as sustainable energy production, efficient energy storage, and climate change.
Traditionally, the search for new materials has relied heavily on experimental trial-and-
error methods guided by human intuition and expertise. For instance, current approaches include
the substitution method, which involves systematically replacing elements or functional groups
within a known structure to explore potential variations that might lead to desirable properties.
This approach, while valuable, is severely limited by both time and resources - a single material
can take 6-24 months to validate and cost between $50,000-$500,000. Moreover, synthesizing and
characterizing a single new compound often requires multiple validation iterations due to unex-
pected lab behaviors, and the number of possible combinations of elements and crystal structures
is virtually limitless (Pickard and Needs, 2011). As a result, many potentially transformative ma-
terials remain undiscovered.
In recent years, computational methods have significantly accelerated materials discov-
ery by enabling high-throughput screening of material properties using first-principles calcula-
tions such as density functional theory (DFT) (Jain et al., 2017). The establishment of large open

2
databases of computed material properties, such as the Materials Project (Jain et al., 2013), OQMD
(Saal et al., 2013), AFLOW (Curtarolo et al., 2012), and NOMAD (Draxl and Scheffler, 2019), has
provided researchers with unprecedented access to data on thousands of compounds. These re-
sources have facilitated the identification of promising candidates for applications ranging from
thermoelectrics to catalysis (Sendek et al., 2017; Liu et al., 2017).
Despite significant advancements, high-throughput screening remains fundamentally con-
strained by multiple factors. First, it depends heavily on known or easily generated structures,
limiting exploration of truly novel materials. Second, even the largest computational searches have
covered only a tiny portion of the possible material space; for example, in terms of stoichiome-
try alone, there are estimated to be around 1010 potential quaternary inorganic compounds (Bartel
et al., 2020), far exceeding the typical 106 to 107 compounds explored in screenings (Schütt et al.,
2018; Liu et al., 2017). In addition, these approaches face computational bottlenecks (many den-
sity functional theory calculations require significant computing resources and may still produce
numerical inaccuracies for complex systems). Furthermore, materials predicted to be promising
may prove difficult to synthesize or unstable in practice.
To generate novel structures, there has been a growing focus on using machine learning
models or advanced molecular algorithms that satisfy stable material requirements which exist in a
low-dimensional subspace of the 1010 possible stoichiometric arrangements of materials (Xie and
Jaakkola, 2021). Unlike traditional screening and substitution, algorithmic approaches assist in
creating new material candidates in mass with desired stability based on DFT calculations from
the outset. Algorithmic approaches leverage computational techniques such as generative models,
evolutionary algorithms, and reinforcement learning to more effectively explore the vast chemical
space.
Generative models have emerged as promising tools for materials discovery due to their
ability to learn complex distributions of material structures and generate novel compounds absent
from current databases. These deep learning architectures excel at capturing underlying patterns
in material data and proposing new candidates with tailored properties (Xie and Jaakkola, 2021).

3
However, existing generative approaches face significant limitations (Hoffmann et al., 2019). A
primary challenge is producing thermodynamically stable materials - a requirement that demands
both quantum mechanically optimal atomic positions and proper chemical bonding patterns with
neighboring atoms (Xie and Jaakkola, 2021). Furthermore, many models are restricted to limited
set of elements or types of structures (Chen et al., 2021), restricting their broader applicability to
material classes. The challenge is further complicated by the need to preserve multiple crystalline
properties simultaneously: periodic atomic arrangements, rotational and translational symmetries,
and permutation invariance among identical atoms. These constraints, combined with restricted
dataset availability and challenging data curation processes, make it difficult for traditional com-
putational methods to effectively represent and analyze crystalline structures. Although recent
SOTA approaches such as CDVAE (Xie and Jaakkola, 2021) have demonstrated the potential of
generative models for crystal discovery, they face several key limitations. CDVAE’s relatively
small dataset ( 20,000 materials) restricts its ability to learn complex crystallographic patterns,
and its single-stage training process fails to effectively leverage modern advances in latent space
modeling. Additionally, CDVAE employs traditional denoising techniques rather than incorporat-
ing state-of-the-art approaches like rectified flow sampling and logit-normal time conditioning that
have revolutionized other domains such as image, audio, and video generation (SORA and SD3).
This work solves these limitations by combining two powerful tools that make a few key
contributions. First, this approach develops a comprehensive encoding scheme using a specialized
Variational Autoencoder (Kingma and Welling, 2013) that captures material structures—including
lattice parameters, atomic species, and coordinates—in a learned latent space (unlike CDVAE),
leveraging advanced geometric deep learning architectures DimeNet++ (Gasteiger and Günnemann,
2020) and GemNet (Gasteiger and Günnemann, 2021) to respect the geometric constraints of ma-
terials. Second, the model implements a time-conditioned, efficient latent diffusion process with
rectified flow sampling (Liu et al., 2022) and a modified GemNet architecture designed for time
conditioning. This novel two-stage process enables controlled exploration of stable material con-
figurations while maintaining geometric equivariances and invariances of crystalline structures.

4
Third, the work curates and provides a large-scale dataset of approximately 1.5 million thermody-
namically stable materials (verified via DFT with formation energies < 0.1 eV above hull) derived
from the Materials Project (Jain et al., 2013) and Alexandria (Project, 2022) databases to train
the model and enable property-guided generation in terms of specialized MLPs in the fine-tuning
stage.
By incorporating the latest advances in diffusion modeling from Stable Diffusion 3 (Esser
et al., 2024) and separating training into VAE and diffusion stages, this framework enables un-
precedented control over material generation. The VAE focuses on learning meaningful crystal
representations while the diffusion model leverages modern denoising advances, making this the
first latent diffusion model for crystalline materials to significantly reduce the search space for vi-
able candidates while maintaining high structural validity and chemical feasibility. In advancing
the capabilities of generative models in materials science, this work aims to accelerate the discov-
ery of novel materials and contribute to addressing the technological challenges of the future.

2 Background and Preliminaries

2.1 Universal Representation of Materials

Any crystal structure can be represented by a repeating unit, known as the unit cell, that tiles the
entire three-dimensional space through periodic repetition. The unit cell contains a finite number
of atoms arranged in a specific configuration, which, when repeated, forms the bulk material.
To effectively model and generate new materials, it is essential to have a universal and compact
representation that captures all the necessary information about the unit cell.
We adopt a universal representation for a material M as follows (Zeni and Xie, 2023; Xie
and Jaakkola, 2021):

M = (A, X, L), (1)

5
where:

• A = (a1 , a2 , . . . , aN )⊤ ∈ AN denotes the atomic species of the N atoms inside the unit cell,
with A being the set of possible atomic elements.

• X = (x1 , x2 , . . . , xN )⊤ ∈ [0, 1)N ×3 represents the fractional coordinates of the atoms within
the unit cell.

• L = [l1 , l2 , l3 ] ∈ R3×3 is the lattice matrix, where l1 , l2 , l3 are the lattice vectors defining the
shape and size of the unit cell.

This representation encapsulates all the essential information about the material’s crystal
structure, including the types of atoms present, their positions relative to the unit cell, and the
geometric parameters defining the unit cell itself.

2.1.1 Lattice Parameters and the Lattice Matrix

The lattice matrix L defines the fundamental periodicity of a crystal structure through three lattice
vectors l1 , l2 , and l3 . These vectors are determined by six lattice parameters:

• Three lattice constants (a, b, c) representing the lengths of the unit cell edges.

• Three interaxial angles (α, β, γ) where:

– α is the angle between vectors l2 and l3 .

– β is the angle between vectors l1 and l3 .

– γ is the angle between vectors l1 and l2 .

The lattice vectors can be expressed in Cartesian coordinates as:

6
     
cos β
1 cos γ  



cos α−cos β cos γ
   
l1 = a 
0 , l2 = b 
 sin γ  , l3 = c
   
sin γ 
    r  2 
1 − cos2 β − cos α−cos β cos γ
 
0 0 sin γ

Here, the angles α, β, and γ are measured in radians, and cos α, cos β, and cos γ denote the
cosines of these angles.
The volume of the unit cell is given by the absolute value of the determinant of the lattice
matrix:

p
V = |det L| = abc 1 − cos2 α − cos2 β − cos2 γ + 2 cos α cos β cos γ. (2)

For physical validity, the lattice matrix must be non-singular, ensuring a positive, non-zero
unit cell volume.
In our VAE model, we perform operations directly on the lattice parameters (a, b, c, α, β,
γ), reconstructing the lattice matrix L as needed using the relationships above; however, we do not
perform diffusion directly on the lattice and angle vectors. This approach allows us to efficiently
explore variations in the unit cell geometry while maintaining crystallographic consistency during
the generation of new material structures. FLAG FLAG

2.1.2 Fractional and Cartesian Coordinates

The atomic positions within the unit cell are expressed in fractional coordinates X, which are
scaled relative to the lattice vectors. The fractional coordinates are confined to the unit cube [0, 1)3 ,
accounting for the periodicity of the crystal structure. The periodicity in fractional coordinates is
defined by the unit hypertorus, implying the equivalence relation x ∼ x + k for k ∈ Z3 .
The conversion between fractional coordinates X and Cartesian coordinates X̃ is given by:

7
X̃ = LX, (3)

X = L−1 X̃. (4)

This relationship allows us to compute the absolute positions of atoms in space, which
is essential for calculating interatomic distances and interactions (Zeni and Xie, 2023; Xie and
Jaakkola, 2021).

2.2 Graph Representations and Diffusion on Lattice Parameters

Materials can be represented as directed multigraphs G = (V, E) to encode their periodic structures
(Xie and Grossman, 2018). In this representation:

• V = {v1 , v2 , . . . , vN } is the set of nodes representing atoms.

(k)
• E = {eij | i, j ∈ {1, . . . , N }, k ∈ Z3 } is the set of edges representing interactions between
(k)
atoms. Each edge eij connects atom i in the reference unit cell to atom j in a neighboring
unit cell shifted by the lattice vector k = (k1 , k2 , k3 ).

Edges can be defined based on various criteria, such as connecting atoms within a certain
distance or using k-nearest neighbors under periodic boundary conditions. This graph-based rep-
resentation allows the use of message-passing neural networks and equivariant neural networks for
material representation learning.
In our generative model, we aim to generate stable material structures by performing diffu-
sion processes on the lattice parameters—specifically, the lattice constants a, b, c and angles α, β,
γ. By focusing on the lattice parameters, we can explore variations in the unit cell geometry that
may lead to stable and novel material configurations.
We perform diffusion on all latent variables representing the material’s structural and chem-
ical properties, including lattice parameters, atomic positions, and species. This approach allows

8
us to comprehensively explore the space of possible crystal structures while maintaining physical
and chemical validity through the learned latent space representations.

2.2.1 Niggli Reduction of Lattice Cells

To ensure that the generated lattice parameters correspond to physically meaningful and standard-
ized unit cells, we employ Niggli reduction (Hahn, 1983). Niggli reduction transforms the lattice
parameters into a reduced form, removing redundancies and ensuring that equivalent lattices are
represented uniquely. This standardization is crucial for the stability and uniqueness of the gener-
ated materials.
By incorporating Niggli-reduced lattice cells in our model, we address the periodic invari-
ance of materials and avoid generating duplicate structures that are physically identical but differ
in their lattice representations.

2.3 Material Invariances

Material structures are characterized by several invariances arising from their physical and chemi-
cal properties. Properly accounting for these invariances is essential for developing machine learn-
ing models that can accurately predict material properties and generate new material structures
(Bronstein and Veličković, 2021). The key invariances include:

1. Permutation Invariance: Exchanging the indices of any pair of atoms of the same element
does not alter the material’s structure or properties.

2. Translation Invariance: Translating the entire material structure by an arbitrary vector in


space does not change the material, as only the relative positions of atoms matter.

3. Rotation Invariance: Rotating the material structure, including both the atomic coordinates
and the lattice vectors, by any rotation matrix leaves the material unchanged.

4. Periodic Invariance: There are infinitely many ways to choose unit cells with different
shapes and sizes (e.g., obtaining a larger unit cell by integer multiples of lattice vectors).

9
Different choices of unit cells represent the same material due to the periodicity of the crystal
structure.

Accounting for these invariances ensures that the machine learning models learn meaning-
ful and physically relevant features, leading to more accurate predictions and generation of material
structures (Zeni and Xie, 2023).

2.4 Machine Learning Models for Material Representation

Machine learning models designed for materials must handle the complexities and invariances dis-
cussed above. In our work, we leverage advanced neural network architectures such as DimeNet++
(Gasteiger and Günnemann, 2020) and GemNet (Gasteiger and Günnemann, 2021), which are ca-
pable of capturing the geometric and chemical intricacies of materials.
These models process the graph representation of materials, incorporating information
about atomic species, positions, and interactions to learn rich representations that can be used for
energy prediction; however, we utilized the adapted version of this model to materials in CDVAE
(Xie and Jaakkola, 2021).

2.5 Variational Autoencoders

Variational autoencoders (VAEs) provide a framework for learning compressed latent representa-
tions of complex data distributions. For material structures, a VAE consists of an encoder Eϕ that
maps material features M = (A, X, L) to a distribution over latent variables z, and a decoder Dξ
that reconstructs the material from samples of this distribution (Kingma and Welling, 2013):

qϕ (z|M ) = N (µϕ (M ), σϕ2 (M )) (5)

N
Y
pξ (M |z) = pξ (mi |z), mi ∈ M (6)
i=1

10
The VAE is trained by maximizing the evidence lower bound (ELBO):

LVAE = Eqϕ (z|M ) [log pξ (M |z)] − βDKL (qϕ (z|M )||p(z)) (7)

where β is the beta term that balances reconstruction quality against regularization towards
the prior p(z) = N (0, I).

2.6 Rectified Flow Diffusion Process

Our material generation approach employs rectified flow diffusion (Liu et al., 2022; Albergo and
Vanden-Eijnden, 2022; Lipman et al., 2023), which defines a straight path between data and noise
distributions in latent space. This formulation has been proven as pivotal to the diffusion process
of high complexity data such as images in Stable Diffusion 3 (Esser et al., 2024). The forward
process can be described by:

zt = (1 − t)z0 + tϵ, ϵ ∼ N (0, I) (8)

where t ∈ [0, 1] is the diffusion time. This process follows the ordinary differential equa-
tion:

dzt
= vθ (zt , t) (9)
dt

During training, we optimize the velocity field vθ by minimizing (Liu et al., 2022):

L = Et,ϵ w(t)∥ϵθ (zt , t) − ϵ∥2


 
(10)

where w(t) is a weighting function and ϵθ predicts the noise components. For generation,
we employ an Euler discretization scheme (?):

zt−1 = zt − αt vθ (zt , t) + σt ξt (11)

11
where αt is the step size, σt is a noise scale, and ξt ∼ N (0, I) is random noise. This
formulation allows efficient sampling while maintaining the geometric constraints required for
material generation.

3 Method

3.0.1 DimeNet++ Encoder Architecture

The encoder maps material structures to latent representations using a DimeNet++ neural network
architecture (??), specifically adapted for crystalline materials. This network processes both local
and global structural information while handling periodic boundary conditions (PBCs).

Graph Construction and Feature Processing The network first converts fractional coordinates
to Cartesian coordinates and constructs a periodic graph representation by identifying neighboring
atoms within a cutoff radius (Schütt et al., 2017). Geometric features are then extracted, including
interatomic distances and angles, while accounting for periodic boundary conditions (Klicpera
et al., 2020).

Feature Embedding The network employs specialized embedding functions for geometric and
atomic features (Unke and Meuwly, 2019; Behler and Parrinello, 2007). It is important to note
at this point for the purpose of understanding the GNNs used, both the DimeNet++ encoder and
GemNet decoder architectures share similar fundamental components for processing geometric
features in crystal structures in terms of the feature embeddings. While these architectures contain
many sophisticated elements, the feature embedding process, described below, serves as a crucial
foundation for both the encoder and decoder GNNs. Below is the simplified version of the feature
embeddings employed in these networks:

• Radial Basis Functions (RBF): Interatomic distances are expanded using:

RBF(dij ) = {ϕn (dij )}N r


n=1 (12)

12
where ϕn are predefined basis functions

• Spherical Basis Functions (SBF): Angular information is encoded using:

SBF(dij , djk , θijk ) = {ϕn (dij )ϕn (djk )Ylm (θijk )}n,l,m (13)

where Ylm (θ) are spherical harmonics

Message Passing and Interactions Information is processed in DimeNet++ through interaction


blocks that update atomic representations (?):

(l) (l)
eij = fedge (hi , hj , RBF(dij )) (14)

X X
mi = fmessage (eij , ejk , SBF(dij , djk , θijk )) (15)
j∈N (i) k∈N (j)

(l+1) (l)
hi = hi + fupdate (mi ) (16)

This process captures both two-body (distance) and three-body (angle) interactions through
multiple layers of message passing. The final output is aggregated through pooling operations
to produce a global representation h ∈ RB×D , where B is the batch size and D is the hidden
dimension (256).

3.0.2 Variational Encoding of Crystal Structures

After obtaining the hidden representation h from the encoder network (e.g., DimeNet++), we
employ a variational approach to encode crystal structures into a latent space (Kingma and Welling,
2013). The hidden representation is transformed into a latent encoding z through:

z = µ + σ ⊙ ϵ, ϵ ∼ N (0, I) (17)

13
where µ = fµ (h) and log σ 2 = fvar (h) are learned transformations of the hidden representation,
and ⊙ denotes element-wise multiplication.
The latent representation z is then transformed through a series of neural networks and
functions to obtain the crystal components:

Natoms = fnum atoms (z), {L, θ} = flattice (z),

Za = fcomposition (z, Natoms ), Zx = fcoords (z, Natoms )

Here, Natoms is the predicted number of atoms per crystal, L represents the lattice lengths, θ
denotes the lattice angles, Za are the atomic composition latents, and Zx are the atomic coordinate
latents. The functions fnum atoms , flattice , fcomposition , and fcoords are broader functions comprised
partially of neural networks parameterized to predict these components from the latent vector z.
The lattice parameters are further processed through an inverse scaling transformation to
ensure physical validity, with length parameters specifically scaled according to the number of
atoms in the structure. This approach enables backpropagation through the sampling process while
maintaining the stochastic nature of the variational autoencoder.

3.1 GemNet Decoder and Loss Functions

The decoder employs a modified GemNet architecture (Gasteiger and Günnemann, 2021) to recon-
struct crystal structures from the latent representation. Operating on the sampled latent variables
(z, Zx , Za ) as well as the predicted lengths and angles {L, θ} (which are directly sampled from the
VAE), the decoder first processes atomic positions through a series of message-passing operations:

h, xcart = fGemNet (z, Zx , Za , Natoms , L, θ) (18)

Message Passing and Interaction Blocks The GemNet decoder processes the atomic embed-
dings h and edge messages m through a series of interaction blocks to capture complex atomic
interactions and geometric configurations. At each layer l, the embeddings are updated using the

14
interaction function:

(l)
h(l+1) , m(l+1) = fint h(l) , m(l) , RBF, CBF, edges

(19)

After updating the embeddings, the outputs at each layer are computed using the output function:

(l+1)
E (l+1) , F (l+1) = fout h(l+1) , m(l+1) , RBF

(20)

Here:

• h(l) and m(l) are the atomic embeddings and edge messages at layer l

(l)
• fint is the interaction function at layer l, which updates the embeddings based on the previous
layer’s embeddings, radial basis functions (RBF), circular basis functions (CBF), and edge
information

(l+1)
• fout is the output function at layer l + 1, which computes the energy contributions E (l+1)
and force components F (l+1) from the updated embeddings

• edges contains the edge indices and associated geometric information necessary for message
passing

The final atom type predictions are obtained through:

P (Z) = softmax(ffc (h)) (21)

The training objective combines multiple loss terms:

LVAE = λ1 Lcoord + λ2 Ltype + λ3 Llattice + λ4 Lnum + λ5 LKL

where:
Lcoord = ∥xcart − ffrac→cart (xgt 2
frac , L, θ)∥2

15
Ltype = CrossEntropy(P (Z), Zgt − 1)

1/3
Llattice = ∥ξ(L, θ) − ξ(Lgt /Natoms , θ gt )∥22

gt
Lnum = CrossEntropy(P (Natoms ), Natoms )

LKL = Eqϕ (z|x) [log qϕ (z|x) − log p(z)]

Here, ξ represents the lattice normalization function, and the lattice parameters l are specif-
1/3
ically scaled by Natoms to account for crystal size (Zeni and Xie, 2023). This scaling ensures that
larger crystals have appropriately larger lattice dimensions. The coordinate loss Lcoord operates
in Cartesian space, comparing the predicted Cartesian coordinates xcart to the ground truth coor-
dinates obtained by transforming the ground truth fractional coordinates xgt
frac using the predicted

lattice parameters. (The term ”gt” represents ”ground truth” to assist in training via teacher forcing
if necessary.) The type loss Ltype employs cross-entropy loss between the predicted atomic types
P (Z) and the ground truth atomic numbers shifted by one (Zgt − 1) to account for zero-based
indexing. The λ terms in the general loss function serve as weighting factors, ensuring proper
balance among the different loss components and facilitating effective learning in both the encoder
and decoder neural networks. The KL divergence loss LKL (Kingma and Welling, 2013)regularizes
the latent space by encouraging the learned latent distribution to match a standard normal prior,
enabling smooth sampling and interpolation of crystal structures.

3.2 Rectified Flow Diffusion Process

After training the variational autoencoder, we implement a diffusion model in the learned latent
space using a rectified flow formulation (CITE). This approach enables smooth interpolation be-

16
tween the prior distribution and the data distribution while maintaining the physical constraints
learned by the VAE.
The forward process gradually adds noise to the atomic and coordinate components of the
latent representation {Za , Zx } using a continuous-time formulation. Early experiments showed
that applying diffusion to the latent space of angles and lengths hindered learning. Thus, we focus
the diffusion only on {Za , Zx }, with lattice parameters directly learned and sampled from the VAE,
not processed in the diffusion pipeline.
Given the encoded latent variables {Za , Zx }, we sample a noise level u ∼ N (0, 1) and
compute the weighting:
σ(t) = sigmoid(u), t = σ(t) · T

where T is the total number of diffusion steps. This weighting is based on the approach described
in Stable Diffusion 3 (Esser et al., 2024), focusing sampling around the midpoint of the forward
process.
For each latent component, noise is added as follows:

Z̃a = σ(t) · ϵa + (1 − σ(t)) · Za , ϵa ∼ N (0, I) (22)

Z̃x = σ(t) · ϵx + (1 − σ(t)) · Zx , ϵx ∼ N (0, I) (23)

where Zi ∈ {Za , Zx } represents each latent component and Z̃i is its noised version. For atomic
coordinates Zx , the noise weighting is adjusted to maintain consistency across atoms:

σx (t) = repeat(σ(t), Natoms )

The neural network ϵθ then predicts the added noise given the noised latents and the

17
timestep:

ϵ̂a = ϵθ (Z̃a , t) (24)

ϵ̂x = ϵθ (Z̃x , t) (25)

The loss function for training is defined as the mean squared error between the true noise
and the predicted noise, with λ terms serving as weighting factors for the diffusion training process:

Ldiffusion = λa ∥ϵ̂a − ϵa ∥22 + λx ∥ϵ̂x − ϵx ∥22

This training procedure enables the model to predict noise added at various noise levels,
effectively learning to denoise the latent variables while maintaining consistency across atomic
structures within each crystal.

3.2.1 Sampling Method

During sampling, we start by sampling a hidden representation z from the prior distribution:

z ∼ N (0, I)

We then use the lattice function we formulated previously to obtain the lattice parameters
(lengths and angles) from z:
{L, θ} = flattice (z)

We initialize the composition and coordinate latents with pure noise:

Z̃Ta ∼ N (0, I), Z̃Tx ∼ N (0, I)

We then iteratively apply the reverse diffusion process using the learned noise predictor ϵθ .

18
At each timestep t, we update the latents Z̃a and Z̃x as follows (CITE DIFFUSERS):

σ(t − 1) − σ(t)
Z̃t−1
i = Z̃ti + ϵθ (Z̃ti , t), for i ∈ {a, x} (26)
σ(t)

After completing all timesteps, we obtain the denoised latents Za = Z̃0a and Zx = Z̃0x . These are
then passed through the decoder, along with the lattice parameters, to generate the final crystal
structures.
This sampling method ensures that the generated structures are consistent with the learned
data distribution and leverages both the VAE’s generative capabilities and the diffusion model’s
refinement process.

3.2.2 Denoising Neural Network Architecture

In our approach, the noise prediction neural network ϵθ is based on the GemNet architecture
(Gasteiger and Günnemann, 2021), which we have adapted to include learnable timestep embed-
dings. This adaptation allows the network to condition its predictions on the current timestep t,
effectively capturing the time-dependent nature of the noise in the diffusion process.
The timestep embedding is incorporated into the network by encoding t using sinusoidal
functions to create a high-dimensional vector e(t):

d/2
e(t) = [sin (ωk t) , cos (ωk t)]k=1

as in Vaswani et al. (2017), where ωk are predetermined frequencies. This embedding e(t)
is then integrated with the atom embeddings through learned MLPs in the network. This enables
the model to adapt its noise predictions according to the noise level at each timestep, similar to
techniques used in previous diffusion models (Ho et al., 2020).

19
4 Implementation

We utilized a dataset of 1.46 million molecules from the Materials Project and Alexandria datasets,
preprocessed and filtered for quality. Specifically, we curated materials with formation energies
less than 0.1 eV above hull to ensure thermodynamic stability, and included only structures with
complete crystallographic information including atomic positions, species, and lattice parameters.
This comprehensive dataset represents a significant advancement over previous approaches like
CDVAE which used only 2̃0,000 materials, enabling our model to learn more complex crystallo-
graphic patterns and chemical relationships.

4.1 Universal Crystalline Representation of Materials

The diagram here outlines the manner in which the input crystal data is transformed throughout
the two-stage training process. It is crucial to note that three unique algorithms are indicated by
this diagram: first, the VAE training phase, second, the rectified flow diffusion training phase,
and finally, the sampling process to generate the thermodynamically stable material once trained.
My approach builds upon the universal representation of materials shown in the input crystal box
(Butler et al., 2018) of Fig. 1, where a material M = (A, X, L) consists of atomic species A,
fractional coordinates X, and lattice matrix L. For continuity purposes, we transform the fractional
coordinates to Cartesian coordinates via the formula X̃ = LX, X = L−1 X̃ (Giacovazzo, 2002)
(where X̃ is Cartesian coordinates) throughout the process to instill inductive biases in the model.

4.2 Crystalline Adapted Variational Autoencoder

Reffering to Algorithm 1, the VAE architecture combines an adapted DimeNet++ encoder (Gasteiger
and Günnemann, 2020) and GemNet (Gasteiger and Günnemann, 2021) decoder, connected through
a structured latent space (as illustrated by Fig. 1).
The encoder processes crystal structures through message passing operations shown in the encoder
box in the diagram, producing mean and variance parameters via MLPs for the VAE sampling step.

20
Crystal Structure Generation Pipeline

Variational Autoencoder (VAE) Latent Diffusion Process

Rectified Flow
Input Crystal Adapted DimeNet++ Encoder

Noise Prediction Network


Latent Space
VAE Sampling

VAE Loss Function


Time Conditioning
Adapted GemNet Decoder

Diffusion Loss Function

Figure 1: Overview of the Crystal Structure Generation Pipeline, showing the Variational Autoen-
coder (VAE) components and Latent Diffusion Process.

The sampled latent vector z is transformed from specialized MLP functions into crystal compo-
nents in the latent space: number of atoms Natoms , lattice parameters {L, θ}, and latent variables
for composition Za and coordinates Zx . Understand that diffusion is not conducted on the lattice
of materials, as early experiments demonstrated unstable training with this approach and much
more stable convergence by artificially creating a simple “separate” decoder for lattice parameters
(and of course Natoms ); this makes sense, however, since it gives the diffusion rectified flow model a
more direct path to the look of a thermodynamically stable crystal when lattice is provided. The de-
coder processes the relevant latent variables as well as the predicted ground-truth lattice and atom
count. The adapted GemNet decoder utilizes message passing and interaction blocks to process
atomic embeddings through multiple layers, capturing complex atomic interactions

21
Algorithm 1 VAE Training for Crystal Struc- Algorithm 2 Rectified Flow Diffusion Training
ture Generation Require: Trained VAE, latent samples
Require: Material dataset D, learning rate η, {Za , Zx },
batch size B 1: timesteps T , hidden state z, lattice params
Ensure: Trained VAE model parameters ϕ, θ, {L, θ}
outputs A, X Ensure: Trained noise prediction network ϵθ
1: Initialize encoder Eϕ (DimeNet++) and de- 2: while not converged do
coder Dθ (GemNet) 3: Sample u ∼ N (0, 1)
2: while not converged do 4: σ(t) ← sigmoid(u)
3: Sample minibatch (A, X, L)B ∼ D 5: t ← σ(t) · T
4: h ← Eϕ (A, X, L) 6: Sample ϵa , ϵx ∼ N (0, I)
5: µ, σ ← fµ (h), fvar (h) 7: Z̃a ← σ(t)ϵa + (1 − σ(t))Za
6: ϵ ∼ N (0, I) 8: Z̃x ← σ(t)ϵx + (1 − σ(t))Zx
7: z←µ+σ⊙ϵ 9: σx (t) ← repeat(σ(t), Natoms )
8: Natoms ← fnum atoms (z) 10: ϵ̂a , ϵ̂x ← ϵθ (z,
9: {L, θ} ← flattice (z) Z̃a , Z̃x , Natoms , L, θ, embed(t))
10: Za , Zx ← fcomp (z), fcoords (z) 11: Ldiffusion ← λa MSE(ϵ̂a , ϵa )+
11: Â, X̂ ← Dθ (z, Za , Zx , Natoms , L, θ) λx MSE(ϵ̂x , ϵx )
12: Compute LVAE using all loss terms 12: Update θ using gradient of Ldiffusion
13: Update ϕ, θ using ∇ϕ,θ LVAE and η 13: end while
14: end while

Figure 2: Two-stage training process for crystal structure generation. Left: VAE training phase for
learning latent representations of crystal structures. Right: Rectified flow diffusion training phase
for improving generation quality.

4.3 Rectified Flow Diffusion Process

Referring to Algorithm 2, the diffusion process operates in the latent space exclusively on {Za , Zx }
using the rectified flow formulation shown in the Rectified Flow box in the diagram. The noise
schedule uses a sigmoid-transformed Gaussian sample to compute noise levels based on the logit-
normal sampling distribution as described optimally in Stable Diffusion 3 (Esser et al., 2024).
The noise prediction network as seen in the diagram conditions on the lattice (lengths and angles)
and sinusoidal learned timestep embeddings e(t) (Vaswani et al., 2017) to predict and remove
noise iteratively—being the first crystal-conditioned rectified flow diffusion model for materials
of its kind. The denoising model is also based on a time-adapted specialized crystalline GemNet;
specifically, the learned time embedding is added directly to the initial atomic embeddings before

22
During sampling (Algorithm 3), I first sample
Algorithm 3 Crystal Structure Generation via z from the prior and decode crystal parameters.
Sampling The diffusion process then progressively de-
Require: Trained VAE and noise prediction noises the composition and coordinate latents
network, timesteps T using the ”Flow Match Euler Discrete” sched-
Ensure: Generated crystal structure uler from the Diffusers library in Huggingface
1: Sample z ∼ N (0, I) (?) shown in the algorithm in steps 9 and 10,
2: Natoms ← fnum atoms (z) with the denoising prediction coming from the
trained time-conditioned GemNet model oc-
3: {L, θ} ← flattice (z)
curring stochastically at each timestep. The
4: Initialize noise scheduler with T timesteps
process runs for 100 inference timesteps, where
at each step the model predicts and removes
5: Initialize Z̃Ta , Z̃Tx ∼ N (0, I) noise components ϵ̂a and ϵ̂x according to the
6: for t from T to 1 do rectified flow formulation. Once the denois-
7: Get current timestep tcurrent from sched- ing process completes, the resulting noise-free
uler latents Z̃0a and Z̃0x represent a novel crystal
8: ϵ̂a , ϵ̂x ← ϵθ (z, Z̃ta , Z̃tx , structure that can be generated by the VAE
Natoms , L, θ, embed(tcurrent )) decoder. This condensed architecture enables
9: Z̃t−1 ← Z̃ta + σ(t−1)−σ(t)
ϵ̂a end-to-end training of both VAE and diffusion
a σ(t)
components while maintaining physical con-
σ(t−1)−σ(t)
10: Z̃t−1
x ← Z̃tx + σ(t)
ϵ̂x straints through the structured latent space and
11: end for specialized neural architectures, ultimately al-
12: Generate final structure via Dθ (z, Z̃0a , Z̃0x , lowing for reliable generation of stable crystal
Natoms , L, θ) structures.

the message passing layers, which then propagates through the edge embeddings and interaction
blocks via the standard GemNet architecture.

4.4 Sampling from the Crystalline Pipeline

5 Training and Results

In order to guarantee thermodynamic stability (formation energy ¡ 0.1 eV above hull), we filtered
each entry in our collection of over 1.5 million crystal structures that we collected from two main
sources: Materials Project and Alexandria. Lattice vectors, fractional coordinates, and elemental
composition were extracted from each crystal after it was collected. After that, we divided the
dataset into training and validation sets, taking care to preserve a variety of crystal systems and

23
stoichiometries. By repeatedly iterating across these structures in mini-batches during training,
our VAE+diffusion pipeline was able to encode and decode stable crystal patterns, progressively
improving the latent representation to capture bonding patterns, periodicity, and symmetry. In
addition to increasing model resilience, this extensive training made it possible for the ensuing
property predictor (MLP) to generalize to a wide variety of materials.

Figure 3: Periodic Table Distribution of Materials in Curated Dataset

This periodic table representation generated by Pymatviz represents the large variety of
materials curated in the novel dataset across the periodic table with material representation in
almost every possible element. In addition, for consistency, we ensure the dataset is filtered to
be less than or equal to 20 atoms per crystal. This ensures stable training and consistency in
our model; plus, materials with the most manufacturing potential typically have less atoms per
molecule. Below is the distribution of atoms per crystal between 2-20 atoms in our dataset.

24
Figure 4: Crystal Atomic Count Distribution

5.1 Property-Guided Generation

In order to channel the power of our model for discovering crystalline materials that meet specific
criteria (e.g., targeted bandgaps or stability thresholds), we introduce a property-guided generation
strategy. This approach comprises two key steps, outlined in Algorithm 4 (training a property MLP
on latent vectors) and Algorithm 5 (property-conditioned sampling). This technique was inspired
by researchers in Korea (CITE); in this paper, the researchers were able to guide latent space
molecular generation toward certain properties, and I got inspired to utilize a similar approach for
my application.

25
Algorithm 4 Property MLP Training
Require: Frozen VAE+diffusion model (Eϕ , Dθ , ϵθ ), Property dataset Dp
Ensure: Trained property MLP fψ
1: Initialize property MLP fψ
2: while not converged do
3: Sample batch (Mi , pi ) ∼ Dp
4: (Ai , Xi , Li ) ← Mi
5: h ← Eϕ (Ai , Xi , Li )
6: µ, σ ← fµ (h), fvar (h)
7: zi ← µ + σ ⊙ ϵ
8: p̂i ← fψ (zi )
9: Lprop ← MSE(p̂i , pi )
10: Update ψ using gradient of Lprop
11: end while

5.2 Property MLP Training Algorithm

After training the main VAE+diffusion architecture on our large dataset, material scientists may
freeze the model’s weights and gather a smaller property-labeled dataset Dp searching for a specific
property at a step in generation. For each crystal in Dp , we encode it using the VAE encoder to
obtain a latent representation z. An MLP, fψ , is then trained to predict the known property (e.g.,
bandgap) from z via a simple regression loss. This yields a lightweight “property head” that can
map any new latent vector to a predicted property value.

26
Algorithm 5 Property-Guided Crystal Generation
Require: Trained VAE and diffusion model, Property MLP fψ , target property ptarget , timesteps T
Ensure: Generated crystal structure with target property
1: Sample z ∼ N (0, I)
2: p̂ ← fψ (z)
3: for i = 1 to Nsteps do
4: Lprop ← (p̂ − ptarget )2
5: z ← z − α∇z Lprop
6: p̂ ← fψ (z)
7: end for
8: Natoms ← fnum atoms (z)
9: {L, θ} ← flattice (z)
10: Initialize Z̃Ta , Z̃Tx ∼ N (0, I)
11: for t from T to 1 do
12: Get current timestep tcurrent from scheduler
13: ϵ̂a , ϵ̂x ← ϵθ (z, Z̃ta , Z̃tx , Natoms , L, θ, embed(tcurrent ))
σ(t−1)−σ(t)
14: Z̃t−1
a ← Z̃ta + σ(t)
ϵ̂a
σ(t−1)−σ(t)
15: Z̃t−1
x ← Z̃tx + σ(t)
ϵ̂x
16: end for
17: Generate final structure via Dθ (z, Z̃0a , Z̃0x , Natoms , L, θ)

5.3 Property-Guided Sampling Algorithm

To generate crystals with a user-defined target property ptarget , we first sample a latent vector z
from N (0, I). We then gently adjust (or “nudge”) z by taking a few gradient steps against the
property MLP’s prediction error, steering it toward the desired ptarget . Once z is aligned, we decode
the lattice parameters and run the standard rectified flow diffusion steps on {Za , Zx } to ensure the
final structure is both thermodynamically plausible and property-oriented. This allows researchers

27
to focus their experimental or computational efforts on a narrower set of candidate materials al-
ready biased toward the desired characteristics, substantially reducing trial-and-error in practical
workflows.

5.4 Reconstruction Quality

Evaluated using Mean Squared Error (MSE) and Structural Similarity Index (SSIM) between orig-
inal and reconstructed molecular graphs.

Metric Training Set Test Set

MSE 0.0025 0.0030


SSIM 0.987 0.982

Table 1: Reconstruction performance metrics

5.5 Property Prediction

Demonstrated the ability to generate materials with targeted properties, validated against known
data. For instance, materials with a desired band gap range were generated and verified using
density functional theory (DFT) calculations.

5.6 Visualization

6 Discussion

The integration of VAE and latent space diffusion shows promise in accelerating material discov-
ery. The model provides a more efficient exploration of the material space compared to traditional
methods. By leveraging advanced neural network architectures and diffusion principles, we can
capture the intricate relationships in material structures.

28
6.1 Advantages

• Efficiency: Reduces the time required to find suitable material candidates.

• Scalability: Can handle large datasets and generate diverse materials.

• Flexibility: Capable of property-guided generation for specific applications.

6.2 Limitations

• Computational Resources: Training requires significant computational power.

• Data Quality: Relies on the quality of input data; erroneous data can affect performance.

7 Conclusion

This work presents a novel approach to material discovery using advanced machine learning tech-
niques. By making the code and model weights publicly available, we aim to empower material
scientists and manufacturers to utilize this tool for rapid material innovation.

Future Work

Future directions include:

• Incorporating more complex material properties into the model.

• Extending the framework to handle 3D crystal structures.

• Collaborating with experimentalists to validate generated materials.

29
Acknowledgments

We thank the Materials Project and Alexandria for the datasets provided. We also acknowledge the
support from AWS for computational resources.

References

Albergo, M. and Vanden-Eijnden, E. (2022). Stochastic interpolants for flow-based modeling.


arXiv preprint arXiv:2203.16126.

Bartel, C. J., Trewartha, A., Wang, Q., Dunn, A., Jain, A., and Ceder, G. (2020). A critical
examination of compound stability predictions from machine-learned formation energies. npj
Computational Materials.

Behler, J. and Parrinello, M. (2007). Generalized neural-network representation of high-


dimensional potential-energy surfaces. Physical Review Letters, 98(14):146401.

Bronstein, M.M., B. J. C. T. and Veličković, P. (2021). Geometric deep learning: Grids, groups,
graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478.

Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O., and Walsh, A. (2018). Materials represen-
tation and learning. Journal of Chemical Physics.

Chen, L., Zhang, W., Nie, Z., Li, S., and Pan, F. (2021). Generative models for inverse design of
inorganic solid materials. Journal of Materials Informatics.

Curtarolo, S., Choudhary, S. P., Liao, M., Aguey-Zinsou, S., Wolverton, S., and Aguey-Zinsou,
G. (2012). Aflowlib.org: A distributed materials properties repository from high-throughput ab
initio calculations. Computational Materials Science.

Draxl, C. and Scheffler, M. (2019). The nomad laboratory: from data sharing to artificial intelli-
gence. Journal of Physics: Materials, 2(3):036001.

30
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer,
A., Boesel, F., et al. (2024). Scaling rectified flow transformers for high-resolution image syn-
thesis. arXiv preprint arXiv:2403.03206.

Gasteiger, J., B. F. and Günnemann, S. (2021). Gemnet: Universal directional graph neural net-
works for molecules. In Advances in Neural Information Processing Systems, volume 34, pages
6790–6802.

Gasteiger, J., G. S. M. J. and Günnemann, S. (2020). Fast and uncertainty-aware directional mes-
sage passing for non-equilibrium molecules. In Machine Learning for Molecules Workshop at
NeurIPS 2020.

Giacovazzo, C. (2002). Crystallographic Computing. Oxford University Press, Oxford.

Hahn, T., editor (1983). International Tables for Crystallography, Volume A: Space-group Symme-
try. D. Reidel Publishing Company, Dordrecht, Holland and Boston, USA.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in
Neural Information Processing Systems, volume 33, pages 6840–6851.

Hoffmann, J., Maestrati, L., Sawada, Y., Tang, J., Sellier, J. M., and Bengio, Y. (2019). Data-driven
approach to encoding and decoding 3-d crystal structures.

Jain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D.,
Skinner, D., Ceder, G., and Persson, K. A. (2013). Commentary: The materials project: A
materials genome approach to accelerating materials innovation. APL Materials.

Jain, A., Voznyy, O., and Sargent, E. H. (2017). High-throughput screening of lead-free
perovskite-like materials for optoelectronic applications. The Journal of Physical Chemistry
C, 121(13):7183–7187.

Kingma, D. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint


arXiv:1312.6114.

31
Klicpera, J., Groß, J., and Günnemann, S. (2020). Directional message passing for molecular
graphs. In Advances in Neural Information Processing Systems, volume 33, pages 8322–8333.

Lipman, Y. et al. (2023). Stochastic sampling and path optimization in rectified flow models. arXiv
preprint arXiv:2304.09189.

Liu, P. et al. (2022). Rectified flows for efficient generation. arXiv preprint arXiv:2204.09078.

Liu, Y., Zhao, T., Ju, W., and Shi, S. (2017). Materials discovery and design using machine
learning. Journal of Materiomics.

Pickard, C. J. and Needs, R. J. (2011). Ab initio random structure searching. Journal of Physics:
Condensed Matter.

Project, A. (2022). Alexandria: A database of theoretical crystal structures.

Saal, J. E., Kirklin, A., Aykol, C. A., Meredig, M., and Wolverton, C. (2013). Materials design and
discovery with high-throughput density functional theory: The open quantum materials database
(oqmd). JOM.

Schütt, K. T., Kindermans, P.-J., Sauceda, H. E., Chmiela, S., Tkatchenko, A., and Müller, K.-
R. (2017). Schnet: A continuous-filter convolutional neural network for modeling quantum
interactions. In Advances in Neural Information Processing Systems, volume 30.

Schütt, K. T., Kindermans, P.-J., Sauceda, H. E., Chmiela, S., Tkatchenko, A., and Müller, K.-
R. (2018). Schnet: A continuous-filter convolutional neural network for modeling quantum
interactions. Advances in Neural Information Processing Systems.

Sendek, A. D., Yang, Q., Cubuk, E. D., Duerloo, K.-A. N., Cui, Y., and Reed, E. J. (2017). Holistic
computational structure screening of more than 12 000 candidates for solid lithium-ion conduc-
tor materials. Energy Environ. Sci.

32
Unke, O. T. and Meuwly, M. (2019). Physnet: A neural network for predicting energies,
forces, dipole moments, and partial charges. Journal of Chemical Theory and Computation,
15(6):3678–3693.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing
Systems, volume 30, pages 5998–6008.

Xie, T., F. X. G. O. B. R. and Jaakkola, T. (2021). Crystal diffusion variational autoencoder for
periodic material generation. arXiv preprint arXiv:2110.06197.

Xie, T. and Grossman, J. (2018). Crystal graph convolutional neural networks for an accurate and
interpretable prediction of material properties. Physical review letters, 120(14):145301.

Zeni, C., P. R. Z. D. F. A. H. M. F. X. S. S. C. J. S. L. S. J. N. B. S. H. L. S. H. C.-W. L. Z. Z. Y. Y.


H. H. H. L. J. T. R. and Xie, T. (2023). Mattergen: a generative model for inorganic materials
design. arXiv preprint arXiv:2312.03687.

33

You might also like