🧬 VenusMutHub + TabPFN: Protein Mutation Effect Prediction with Foundation Models

This project investigates how transformer-based foundation models can predict the functional impact of protein mutations using zero-shot tabular learning. We apply TabPFN and its ensembling extensions to the VenusMutHub dataset, a deep mutational scanning benchmark focused on the Venus fluorescent protein.

📌 Why This Matters

Protein mutation effect prediction is central to protein engineering, synthetic biology, and disease variant interpretation. Traditional models require tuning and large training sets, but TabPFN:

Learns from millions of simulated tasks
Makes zero-shot predictions in one pass
Excels on small, noisy biological datasets

📚 Dataset: VenusMutHub (Hugging Face Datasets)

The VenusMutHub dataset provides deep mutational scanning data for the Venus fluorescent protein. Each entry represents a mutation, its altered sequence, and a fitness score corresponding to a biochemical property.

Each row includes:

Protein mutation descriptors (e.g. position, amino acid change)
A numerical fitness_score target

How to Access the Dataset

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("AI4Protein/VenusMutHub")

➡️ Read Hugging Face Datasets documentation

🎯 Task & Evaluation

Goal: Predict continuous mutation effects from protein descriptors
Input Features: Encoded amino acid mutations (position, substitution)
Target: fitness_score (e.g. fluorescence or stability)
Model: TabPFNRegressor + AutoTabPFNRegressor
Evaluation Metrics:
- R² Score
- Mean Squared Error (MSE)

🧠 Model: TabPFN

TabPFN is a transformer-based foundation model trained on millions of tabular tasks to support zero-shot predictions without hyperparameter tuning.

Key Tools

TabPFNRegressor: Predicts continuous labels on tabular data
AutoTabPFNRegressor: From TabPFN Extensions; uses post-hoc ensembling for improved accuracy

🧪 Results

After training on ~70% of the dataset and testing on the remaining 30%, we obtained the following:

Model	Mean Squared Error	R² Score
TabPFNRegressor	0.0294	0.7558
AutoTabPFNRegressor	0.0293	0.7570

These results reflect strong out-of-the-box generalization for mutation effect prediction without domain-specific tuning.

🛠️ Setup & Installation

1. Clone this repository and create a virtual environment

git clone https://github.com/chewyuenrachael/tabpfn-venusmuthub.git
cd tabpfn-venusmuthub
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

Contents of requirements.txt:

tabpfn
datasets
pandas
scikit-learn
matplotlib

3. Install TabPFN Extensions (Optional, for ensembling)

git clone https://github.com/priorlabs/tabpfn-extensions.git
pip install -e tabpfn-extensions

🧪 Model Usage

from tabpfn import TabPFNRegressor
from tabpfn_extensions.post_hoc_ensembles.sklearn_interface import AutoTabPFNRegressor

# Initialize and train
model = AutoTabPFNRegressor(device="cuda", max_time=60)
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)

📈 Results Summary

TabPFN demonstrates strong predictive performance with:

Minimal preprocessing
No need for hyperparameter tuning
Support for small/noisy biological datasets

💡 Why Use TabPFN?

🧠 Zero-shot predictions with no model selection
⚡ Fast inference (seconds on GPU)
🧬 Tailored for small-data bioinformatics problems
🔍 SHAP-based interpretability extensions available

🧠 Future Directions

Add multi-target regression for correlated protein properties
Integrate PLM embeddings from models like ESM or ProtT5
Explore SHAP and feature attribution from tabpfn-extensions
Use unsupervised module for outlier detection in mutation sets

📜 Attributions

Dataset

AI4Protein. VenusMutHub: A Benchmark for Protein Mutation Effect Prediction
➡️ Hugging Face Dataset
📜 Licensed under the MIT License

Model

TabPFN by Prior Labs
Hollmann et al. (2025). Accurate predictions on small data with a tabular foundation model. Nature.
DOI: 10.1038/s41586-024-08328-6

TabPFN Extensions:
https://github.com/priorlabs/tabpfn-extensions
Apache 2.0 License

🙌 Acknowledgements

Built using tools from:

🤗 Hugging Face Datasets
🔬 Prior Labs' TabPFN ecosystem
🔥 PyTorch
📊 scikit-learn & matplotlib

Special thanks to the creators of VenusMutHub and TabPFN for providing robust tools for protein mutation analysis in modern machine learning pipelines.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
tabpfn-extensions		tabpfn-extensions
.gitignore		.gitignore
README.md		README.md
TabPFN_VenusMutHub.ipynb		TabPFN_VenusMutHub.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 VenusMutHub + TabPFN: Protein Mutation Effect Prediction with Foundation Models

📌 Why This Matters

📚 Dataset: VenusMutHub (Hugging Face Datasets)

How to Access the Dataset

🎯 Task & Evaluation

🧠 Model: TabPFN

Key Tools

🧪 Results

🛠️ Setup & Installation

1. Clone this repository and create a virtual environment

2. Install dependencies

3. Install TabPFN Extensions (Optional, for ensembling)

🧪 Model Usage

📈 Results Summary

💡 Why Use TabPFN?

🧠 Future Directions

📜 Attributions

Dataset

Model

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

chewyuenrachael/tabpfn-venusmuthub

Folders and files

Latest commit

History

Repository files navigation

🧬 VenusMutHub + TabPFN: Protein Mutation Effect Prediction with Foundation Models

📌 Why This Matters

📚 Dataset: VenusMutHub (Hugging Face Datasets)

How to Access the Dataset

🎯 Task & Evaluation

🧠 Model: TabPFN

Key Tools

🧪 Results

🛠️ Setup & Installation

1. Clone this repository and create a virtual environment

2. Install dependencies

3. Install TabPFN Extensions (Optional, for ensembling)

🧪 Model Usage

📈 Results Summary

💡 Why Use TabPFN?

🧠 Future Directions

📜 Attributions

Dataset

Model

🙌 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages