Skip to content

CDDLeiden/uqdd

Repository files navigation

UQDD logo

Hybrid Uncertainty Quantification for Bioactivity Assessment

[Paper] Combining Bayesian and Evidential Uncertainty Quantification for Improved Bioactivity Modeling — https://pubs.acs.org/doi/10.1021/acs.jcim.5c01597

python pytorch wandb license paper

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Results
  5. License
  6. Contact
  7. Acknowledgments
  8. Contributing
  9. Citation
  10. Statistical Significance Analysis

About The Project

This repository accompanies the paper "Combining Bayesian and Evidential Uncertainty Quantification (UQ) for Improved Bioactivity Modelling". It supports the study on hybrid UQ models. We introduce and benchmark two novel hybrid models—* EOE (Ensemble of Evidential networks)* and EMC (Evidential MC Dropout)—which combine the strengths of Bayesian and evidential learning paradigms. These models are evaluated against established UQ baselines on curated subsets of the * Papyrus++* dataset for both xC50 and Kx bioactivity endpoints.

Our comprehensive evaluation includes metrics for performance (e.g., RMSE), calibration, probabilistic scoring, and decision utility (e.g., RRC-AUC). The findings highlight EOE10 as a robust and computationally efficient model, outperforming deep ensembles in several uncertainty-aware settings.

(back to top)

Getting Started

Prerequisites

We recommend using a virtual environment (e.g. conda) to manage dependencies.

  • Conda (Miniconda or Anaconda)
  • Python ≥ 3.8
  • PyTorch ≥ 2.0
  • RDKit
  • Weights & Biases (wandb)
  • scikit-learn, numpy, pandas, seaborn, matplotlib

(back to top)

Installation

Create and activate the Conda environment: To replicate this work, install dependencies using Conda

conda env create --file=environments/environment_linux.yml
conda activate uqdd-env
  • Please choose the correct file depending on your operating system
  • If the file environment_{OS}.yml gives you an error, please use the ones with "*_conda" suffix.
  • If any issues arise here, please raise an issue following the guidelines in CONTRIBUTING.md file.

(back to top)

Project Folder Structure

The project is organized as follows:

├── .gitignore
├── LICENSE
├── README.md
├── environments/              # Conda and pip environment definitions
│   ├── environment_linux.yml
│   ├── environment_linux_conda.yml
│   ├── environment_windows.yml
│   └── environment_windows_conda.yml
├── uqdd/                      # Source Code Directory
│   ├── config/                # Configuration files for models and data
│   ├── data/                  # Scripts and utilities for data processing
│   └── models/                # Model implementations and training scripts
├── notebooks/                 # Jupyter Notebooks for exploratory analysis and results visualization
└── scripts/                   # Automated workflows to generate visualizations and analyses

(back to top)

Usage

This package provides scripts for data processing and model training/testing. Below are the main components you will interact with.

Data Processing

Run the following scripts to preprocess the datasets:

python uqdd/data/data_papyrus.py

This will generate the preprocessed datasets in the data/ directory.

As an example to use data_papyrus.py:

python uqdd/data/data_papyrus.py --activity xc50 --descriptor-protein ankh-large --descriptor-chemical ecfp2048 --split-type time --n-targets -1 --file-ext pkl --sanitize --verbose

This command will preprocess the Papyrus dataset for the xC50 activity type, using the ANKH-large protein descriptor and ECFP2048 chemical descriptor, with a time-based split. The output will be saved in the specified file format (pkl) and will sanitize the data.

(back to top)

Model Architectures

Overview of UQ Model Architectures

For a high-level conceptual overview, see the documentation’s Paper Mind Map page, which includes a large image:

  • docs page: docs/reference/mind-map.md

(back to top)

Models

The package provides several scripts to train and test different models. The main entry point is the model_parser.py script, which allows you to specify various options for data, model, and training configuration.

Model Configuration Options

The model_parser.py script allows you to specify various options for data, model, and training configuration. The following options are available:

  • --model: Model type (pnn, ensemble, mcdropout, evidential)
  • --data_name: Dataset name (papyrus, chembl, max_curated)
  • --n_targets: Number of targets to train on (-1 for all targets)
  • --activity_type: Activity type (xc50, kx)
  • --descriptor_protein: Protein descriptor type (ankh-large, ankh-small, unirep, )
  • --descriptor_chemical: Chemical descriptor type (ecfp2048, ecfp1024, , )
  • --split_type: Split type (random, scaffold, time)
  • --ext: File extension (pkl, parquet, csv)
  • --task_type: Task type (regression, classification)
  • --wandb_project_name: Weights and Biases project name for logging
  • --ensemble_size: Ensemble size for ensemble models
  • --num_mc_samples: Number of MC samples for MC-Dropout models
  • --seed: Random seed for reproducibility
  • --epochs: Number of epochs for training
  • --batch_size: Batch size for training
  • --lr: Learning rate for training
  • --seed: Random seed for reproducibility
  • --device: Device for training (cpu, cuda)

PNN Models

To train and test the baseline model, use the following command:

python uqdd/models/model_parser.py --model pnn --data_name papyrus --n_targets -1 --activity_type xc50 --descriptor_protein ankh-large --descriptor_chemical ecfp2048 --split_type random --ext pkl --task_type regression --wandb_project_name pnn-test

Ensemble Models

To train and test the ensemble model, use the following command:

python uqdd/models/model_parser.py --model ensemble --ensemble_size 100 --data_name papyrus --n_targets -1 --activity_type xc50 --descriptor_protein ankh-large --descriptor_chemical ecfp2048 --split_type random --ext pkl --task_type regression --wandb_project_name ensemble-test
  

MC-Dropout Models

To train and test the MC-Dropout model, use the following command:

python uqdd/models/model_parser.py --model mcdropout --num_mc_samples 100 --data_name papyrus --n_targets -1 --activity_type xc50 --descriptor_protein ankh-large --descriptor_chemical ecfp2048 --split_type random --ext pkl --task_type regression --wandb_project_name mcdp-test

Evidential Models

To train and test the Evidential model, use the following command:

python uqdd/models/model_parser.py --model evidential --data_name papyrus --n_targets -1 --activity_type xc50 --descriptor_protein ankh-large --descriptor_chemical ecfp2048 --split_type random --ext pkl --task_type regression --wandb_project_name evidential-test

Ensemble of Evidential (EOE) Models

To train and test the Ensemble of Evidential model, use the following command:

python uqdd/models/model_parser.py --model eoe --ensemble_size 10 --data_name papyrus --n_targets -1 --activity_type xc50 --descriptor_protein ankh-large --descriptor_chemical ecfp2048 --split_type random --ext pkl --task_type regression --wandb_project_name eoe-test

Evidential MC-Dropout (EMC) Models

To train and test the Evidential MC-Dropout model, use the following command:

python uqdd/models/model_parser.py --model emc --num_mc_samples 10 --data_name papyrus --n_targets -1 --activity_type xc50 --descriptor_protein ankh-large --descriptor_chemical ecfp2048 --split_type random --ext pkl --task_type regression --wandb_project_name emc-test

(back to top)

Results

xC50 Results Barplot Kx Results Barplot

(back to top)

Statistical Significance Analysis

This project provides comprehensive statistical significance analysis for model performance and uncertainty quantification. The main function, analyze_significance, performs the following steps automatically:

  • Normality diagnostics: Uses the Shapiro-Wilk test to assess whether metric residuals are normally distributed for each data split.
  • Parametric and non-parametric tests: If normality is met, parametric tests (RM-ANOVA + Tukey HSD) are available; otherwise, non-parametric tests (Friedman test with Nemenyi post-hoc analysis) are performed by default.
  • Pairwise comparisons: Includes Wilcoxon signed-rank tests and Cliff's Delta effect size for model pairs.
  • Bootstrap confidence intervals: Computes confidence intervals for differences in AUC and other metrics.
  • Comprehensive visualizations: Generates boxplots, critical difference diagrams, multiple-comparisons heatmaps (MCS plots), and confidence-interval forest plots for pairwise differences.

All results and plots are saved in the output directory (e.g., figures/{data}/{activity}/all/{project}/). These analyses help determine whether observed differences between models are statistically and practically significant.

(back to top)

License

This project is licensed under the MIT License - see the LICENSE file for details.

(back to top)

Contact

Bola Khalil @bola-khalil - b.a.a.khalil@lacdr.leidenuniv.nl

(back to top)

Acknowledgments

B.K. acknowledges funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 955879. DRUGTrain. B.K., N.D., and H.V.V. are affiliated with Janssen Pharmaceutica, a Johnson and Johnson company.

K.S. acknowledges funding from the ELLIS Unit Linz. The ELLIS Unit Linz, the LIT AI Lab, and the Institute for Machine Learning are supported by the Federal State of Upper Austria

(back to top)

Contributing

We welcome contributions! Please see our CONTRIBUTING.md file for guidelines on how to contribute to this project.

(back to top)

Citation

If you find this work useful, please cite the following paper:

@article{Khalil2025,
year = {2025}, 
title = {{Combining Bayesian and Evidential Uncertainty Quantification for Improved Bioactivity Modeling}}, 
author = {Khalil, Bola and Schweighofer, Kajetan and Dyubankova, Natalia and Westen, Gerard J P van and Vlijmen, Herman van}, 
journal = {Journal of Chemical Information and Modeling}, 
issn = {1549-9596}, 
doi = {10.1021/acs.jcim.5c01597}
}

Documentation

A full documentation site (MkDocs Material) is available and will be published via GitHub Pages:

It includes getting started, user guide, API reference, and a mind map summarizing the paper.

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •