Skip to content

Benchmarking se3-based generative models for protein structure design (⚡pytorch lightning backend)

License

Notifications You must be signed in to change notification settings

BruthYU/protein-se3

Repository files navigation

Benchmarking SE(3)-based Generative Models for Protein Structure Design

Multi-GPU Training supported by Pytorch Lightning⚡

License: MIT arXiv Static Badge Static Badge Static Badge Star

Framework Overview


Supported Methods

Name Paper Venue Date Code
FrameDiff Star
SE(3) diffusion model with application to protein backbone generation
ICML 2023-04-25 Github
FoldFlow Star
SE(3)-Stochastic Flow Matching for Protein Backbone Generation
ICLR 2024-04-21 Github
Genie1 Star
Genie: De Novo Protein Design by Equivariantly Diffusing Oriented Residue Clouds
ICML 2023-06-26 Github
Genie2 Star
Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2
arxiv 2024-05-24 Github
FrameFlow Star
Improved motif-scaffolding with SE(3) flow matching
TMLR 2024-07-17 Github
RFdiffusion Star
De novo design of protein structure and function with RFdiffusion
Nature 2023-07-11 Github

Installation

To get started, simply create conda environment and run pip installation:

conda create -n protein-se3 python=3.9
git clone https://github.com/BruthYU/protein-se3
...
cd protein-se3
pip install -r requirements.txt

Specially, you also need to install NVIDIA's implementation of SE(3)-Transformers to use RFdiffusion. Run script below to install the NVIDIA SE(3)-Transformer:

cd protein-se3/lightning/model/rfdiffusion/SE3Transformer
python setup.py install

Usage

In this section we will demonstrate how to use Protein-SE(3).

How to Preprocess Dataset and Build Cache

Details

All preprocess operations (i.e. how pdb files map to the lmdb cache) are implemented in the folder protein-se3/preprocess. Please refer to this README.md for more instructions.

Protein-SE(3) featurizes proteins with the Alphafold Protein Data Type, and build lmdb cache following the FoldFlow method. Different protein files (mmcif, pdb and jsonl) are unifed into one data type, thus the built cache could be loaded for all integrated methods during training.

python preprocess/process_pdb_dataset.py
# Intermediate pickle files are generated.
python preprocess/build_cache.py
# Filtering configurations are listed in config.yaml, the lmdb cache will/should be placed in preprocess/.cache. 

You can also directly download our preprocessed dataset at Harvard Dataverse

How to Run Training and Inference

Details

Training and inference of all integrated methods are implemented in the lightning workspace (protein-se3\lightning). You can refer to this README.md for more details.

How to Evaluate Different Methods

Details

We evaluate different protein structure design methods on two tasks: Unconditional Scaffolding and Motif Scaffolding. Please refer to README.md for more detailed information.


Benchmark Results

Unconditional Scaffolding across Varying Lengths

Motif Scaffolding on Design24

Secondary Structure Analysis

About

Benchmarking se3-based generative models for protein structure design (⚡pytorch lightning backend)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published