Machine-learning benchmark and workflow for family 1 glycosyltransferase (GT1)–acceptor reactivity prediction under novelty-controlled evaluation.
This repository accompanies the MSc thesis Predicting Glycosyltransferase Acceptor Specificity with Variational Autoencoders and Pretrained Protein–Small-Molecule Representations. It contains a notebook-first pipeline for dataset harmonization, feature generation, novelty-controlled benchmarking, and model evaluation across pooled-feature baseline models, token-level cross-attention, and VAE-based fusion models.
In the thesis benchmark, pooled-feature XGBoost performed best in most settings, while the early-fusion supervised VAE performed strongest under the strictest double-cold enzyme-and-substrate novelty regime.
FRAPPUCCINO stands for:
Family 1 glycosyltransferase (GT1)
Reactivity and
Acceptor-Pair
Prediction with
Pretrained protein–small-molecule representations,
Using
Cross-modal
Compression and
Inference under
Novelty-controlled
Out-of-distribution evaluation.
notebooks/— end-to-end Colab workflowdata/— input data and dataset documentationhelpers/— reusable utility code used by the notebookmodels/— saved model artifacts, configs, or checkpoints (if applicable)reports/— figures, tables, and exported evaluation outputs (if applicable)
- Open
notebooks/. - Run the main notebook.
- Follow the notebook cells in order to reproduce preprocessing, feature generation, benchmark construction, training, and evaluation.
The notebook is the current reference implementation of the project workflow.
This repository focuses on binary GT1 enzyme–acceptor reactivity prediction, with evaluation across enzyme novelty, substrate novelty, and joint enzyme–substrate novelty settings.