The PLAS-HES-5k dataset is a curated collection of 5,000 protein-ligand complexes (PLCs) designed specifically for machine learning applications in drug discovery. This comprehensive dataset includes both bound and unbound conformations, annotated with binding free energies derived from non-equilibrium molecular dynamics simulations and approximate free energy calculations. The structural and energetic diversity represented in PLAS-HES-5k makes it an ideal benchmark and training resource for predictive and generative ML/DL models aimed at improving drug-target binding predictions.
The dataset contains:
-
5,000 protein-ligand complexes in two conformational variants:
- PLAS: Bound conformations starting from PDB database
- HES: Synthetic unbound conformations derived from PLAS-20K, including both low and high energy states
-
For each complex:
- Atomic coordinates
- Binding free energy data
- Parameter files for simulation
Each protein-ligand complex in the dataset includes:
- Bound conformations from PLAS-20K
- Complete atomic coordinates in PDB format
- Parameter files (.prmtop)
- Binding affinity calculations
- Synthetic unbound conformations seeded from PLAS-20K
- Both low and high energy conformational states
- Complete atomic coordinates in PDB format
- Parameter files (.prmtop)
- Binding affinity calculations
.tar.gzarchives containing collections of PLCs.txtfiles listing which PLCs are present in each archive.pdbfiles containing structural information.prmtopfiles containing molecular dynamics parameters.csvfiles with binding affinity data
The PLAS-HES-5k dataset is designed for:
- Training ML/DL models for predicting protein-ligand binding affinities
- Evaluating generative models in drug design
- Research on conformational sampling and binding dynamics
- Benchmarking novel ML/DL architectures for drug discovery applications
-
Environment Setup
- Environment configuration files for simulations are located in the
Envfolder - A YAML file is provided for setting up the conda environment with Plumed2 activated
- Environment configuration files for simulations are located in the
-
Steered Molecular Dynamics
- Simulation scripts are available in the
Steered_Molecular_Dynamicsfolder - Submit jobs using the command:
sh Complete_simulation_setup_hs.sh "Index_Number" "PDB_ID" "cpu-number" "partition"
- Simulation scripts are available in the
-
Trajectory Validation
- The
Trajectory_validationfolder contains scripts to validate trajectories through:- Sigmoidal curve fitting
- RMSD analysis of protein and ligand
- Center-of-mass distance separation measurements
- The
-
Dataset Access
- File structures are generated for each PDB ID
- The complete dataset is publicly available on the India-Data website: https://india-data.org/dataset-details/ef3a1c5b-6ff2-49f7-ae7a-a99f69003849
- Extract the
.tar.gzarchives to access individual PLC data
-
Energy Component Analysis
- The
Distribution_Of_Energy_Componentsfolder contains scripts for:- Reproducing energy component distributions across all PLCs
- Analyzing energy components for individual PLCs
- The
-
Training Machine Learning Models
- The dataset is suitable for various ML/DL approaches:
- Graph Neural Networks
- 3D Convolutional Networks
- Equivariant Neural Networks
- Attention-based models
- The dataset is suitable for various ML/DL approaches:
-
Benchmarking
- Use the binding affinity data to evaluate model performance
- Compare model predictions across both PLAS (bound) and HES (unbound) conformations
If you use this dataset in your research, please cite: [Citation information to be provided by dataset creators after the publication of the dataset]
If you have any query regarding the dataset, you can reach out to Prathit Chatterjee (prathit.chatterjee@ihub-data.iiit.ac.in).
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
1.0.0