Code repository for "Targeting protein-ligand neosurfaces using a generalizable deep learning approach".
- Description
- Method overview
- System requirements
- Installation with Docker
- Preprocess a PDB file
- Computational binder recovery benchmark
- Running a seed search
- Running a seed refinement and grafting
- License
- Reference
Molecular recognition events between proteins drive biological processes in living systems. However, higher levels of mechanistic regulation have emerged, where protein-protein interactions are conditioned to small molecules. Here, we present a computational strategy for the design of proteins that target neosurfaces, i.e. surfaces arising from protein-ligand complexes. To do so, we leveraged a deep learning approach based on learned molecular surface representations and experimentally validated binders against three drug-bound protein complexes. Remarkably, surface fingerprints trained only on proteins can be applied to neosurfaces emerging from small molecules, serving as a powerful demonstration of generalizability that is uncommon in deep learning approaches. The designed chemically-induced protein interactions hold the potential to expand the sensing repertoire and the assembly of new synthetic pathways in engineered cells.
MaSIF-seed has been tested on Linux, and it is recommended to run on an x86-based linux Docker container. It is possible to run on an M1 Apple environment but it runs much more slowly. To reproduce the experiments in the paper, the entire datasets for all proteins consume several terabytes.
Currently, MaSIF takes a few seconds to preprocess every protein. We find the main bottleneck to be the APBS computation for surface charges, which can likely be optimized. Nevertheless, we recommend a distributed cluster to preprocess the data for large datasets of proteins.
MaSIF relies on external software/libraries to handle protein databank files and surface files, to compute chemical/geometric features and coordinates, and to perform neural network calculations. The following is the list of required libraries and programs, as well as the version on which it was tested (in parentheses).
- Python (3.6)
- reduce (3.23). To add protons to proteins.
- MSMS (2.6.1). To compute the surface of proteins.
- BioPython (1.66). To parse PDB files.
- PyMesh (0.1.14). To handle ply surface files, attributes, and to regularize meshes.
- PDB2PQR (2.1.1), multivalue, and APBS (1.5). These programs are necessary to compute electrostatics charges.
- Open3D (0.5.0.0). Mainly used for RANSAC alignment.
- Tensorflow (1.9). Use to model, train, and evaluate the actual neural networks. Models were trained and evaluated on a NVIDIA Tesla K40 GPU.
- StrBioInfo. Used for parsing PDB files and generate biological assembly for MaSIF-ligand.
- Dask (2.2.0). Run function calls on multiple threads (optional for reproducing some benchmarks).
- Pymol (2.5.0). This optional program allows one to visualize surface files.
- RDKit (2021.9.4). For handling small molecules, especially the proton donors and acceptors.
- OpenBabel (3.1.1.7). For handling small molecules, especially the conversion into MOL2 files for APBS.
- ProDy (2.0). For handling small molecules, especially the ligand extraction from a PDB.
MaSIF is written in Python and does not require compilation. Since MaSIF relies on a few external programs (MSMS, APBS) and libraries (PyMesh, Tensorflow, Scipy, Open3D), we strongly recommend you use the Dockerfile and Docker container. Setting up the environment should take a few minutes only.
git clone https://github.com/LPDI-EPFL/masif-neosurf.git
cd masif-neosurf
docker build . -t masif-neosurf
docker run -it -v $PWD:/home/$(basename $PWD) masif-neosurf
Before we can search for complementary binding sites/seeds, we need to triangulate the molecular surface and compute
the initial surface features. The script preprocess_pdb.sh
takes two required positional arguments: the PDB file and a
definition of the chain(s) that will be included.
If a small molecule is part of the molecular surface, we need to tell MaSIF-neosurf where to find it in the PDB file
(three letter code + chain) using the -l
flag. Optionally, we can also provide an SDF file with the -s
flag that
will be used to infer the correct connectivity information (i.e. bond types). This SDF file can be downloaded from the
PDB website for example.
Finally, we must specify an output directory with the -o
flag, in which all the preprocessed files will be saved.
chmod +x ./preprocess_pdb.sh
# with ligand
./preprocess_pdb.sh example/1a7x.pdb 1A7X_A -l FKA_B -s example/1a7x_C_FKA.sdf -o example/output/
# without ligand
./preprocess_pdb.sh example/1a7x.pdb 1A7X_A -o example/output/
For more details on the binder recovery benchmark, please consult the relevant README. The preprocessed dataset can be downloaded from Zenodo.
For more details on the seed search procedure, please consult the relevant README
For more details on the seed refinement and grafting procedure, please consult the relevant README
MaSIF-seed is released under an Apache v2.0 license
@article{marchand2024targeting,
title={Targeting protein-ligand neosurfaces using a generalizable deep learning approach},
author={Marchand, Anthony and Buckley, Stephen and Schneuing, Arne and Pacesa, Martin and Gainza, Pablo and Elizarova, Evgenia and Neeser, Rebecca Manuela and Lee, Pao-Wan and Reymond, Luc and Elia, Maddalena and Scheller, Leo and Georgeon, Sandrine and Schmidt, Joseph and Schwaller, Philippe and Maerkl, Sebastian Josef and Bronstein, Michael and Correia, Bruno Emmanuel},
journal={bioRxiv},
pages={2024--03},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}