This project investigates how transformer-based foundation models can predict the functional impact of protein mutations using zero-shot tabular learning. We apply TabPFN and its ensembling extensions to the VenusMutHub dataset, a deep mutational scanning benchmark focused on the Venus fluorescent protein.
Protein mutation effect prediction is central to protein engineering, synthetic biology, and disease variant interpretation. Traditional models require tuning and large training sets, but TabPFN:
- Learns from millions of simulated tasks
- Makes zero-shot predictions in one pass
- Excels on small, noisy biological datasets
The VenusMutHub dataset provides deep mutational scanning data for the Venus fluorescent protein. Each entry represents a mutation, its altered sequence, and a fitness score corresponding to a biochemical property.
Each row includes:
- Protein mutation descriptors (e.g. position, amino acid change)
- A numerical
fitness_scoretarget
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("AI4Protein/VenusMutHub")β‘οΈ Read Hugging Face Datasets documentation
- Goal: Predict continuous mutation effects from protein descriptors
- Input Features: Encoded amino acid mutations (position, substitution)
- Target:
fitness_score(e.g. fluorescence or stability) - Model: TabPFNRegressor + AutoTabPFNRegressor
- Evaluation Metrics:
- RΒ² Score
- Mean Squared Error (MSE)
TabPFN is a transformer-based foundation model trained on millions of tabular tasks to support zero-shot predictions without hyperparameter tuning.
TabPFNRegressor: Predicts continuous labels on tabular dataAutoTabPFNRegressor: From TabPFN Extensions; uses post-hoc ensembling for improved accuracy
After training on ~70% of the dataset and testing on the remaining 30%, we obtained the following:
| Model | Mean Squared Error | RΒ² Score |
|---|---|---|
| TabPFNRegressor | 0.0294 | 0.7558 |
| AutoTabPFNRegressor | 0.0293 | 0.7570 |
These results reflect strong out-of-the-box generalization for mutation effect prediction without domain-specific tuning.
git clone https://github.com/chewyuenrachael/tabpfn-venusmuthub.git
cd tabpfn-venusmuthub
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtContents of requirements.txt:
tabpfn
datasets
pandas
scikit-learn
matplotlib
git clone https://github.com/priorlabs/tabpfn-extensions.git
pip install -e tabpfn-extensionsfrom tabpfn import TabPFNRegressor
from tabpfn_extensions.post_hoc_ensembles.sklearn_interface import AutoTabPFNRegressor
# Initialize and train
model = AutoTabPFNRegressor(device="cuda", max_time=60)
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)TabPFN demonstrates strong predictive performance with:
- Minimal preprocessing
- No need for hyperparameter tuning
- Support for small/noisy biological datasets
- π§ Zero-shot predictions with no model selection
- β‘ Fast inference (seconds on GPU)
- 𧬠Tailored for small-data bioinformatics problems
- π SHAP-based interpretability extensions available
- Add multi-target regression for correlated protein properties
- Integrate PLM embeddings from models like ESM or ProtT5
- Explore SHAP and feature attribution from
tabpfn-extensions - Use
unsupervisedmodule for outlier detection in mutation sets
AI4Protein. VenusMutHub: A Benchmark for Protein Mutation Effect Prediction
β‘οΈ Hugging Face Dataset
π Licensed under the MIT License
TabPFN by Prior Labs
Hollmann et al. (2025). Accurate predictions on small data with a tabular foundation model. Nature.
DOI: 10.1038/s41586-024-08328-6
TabPFN Extensions:
https://github.com/priorlabs/tabpfn-extensions
Apache 2.0 License
Built using tools from:
- π€ Hugging Face Datasets
- π¬ Prior Labs' TabPFN ecosystem
- π₯ PyTorch
- π scikit-learn & matplotlib
Special thanks to the creators of VenusMutHub and TabPFN for providing robust tools for protein mutation analysis in modern machine learning pipelines.