This repository contains code and analysis for the CAFA-6 (Critical Assessment of Function Annotation) Protein Function Prediction competition on Kaggle.
CAFA-6 is a competition focused on predicting protein functions from sequences using Gene Ontology (GO) and Human Phenotype Ontology (HPO) terms. The goal is to develop computational methods that can accurately predict the biological functions of proteins whose functions have not yet been experimentally determined.
cafa-6-protein-function-prediction/
├── data/
│ ├── raw/ # Original competition data
│ ├── processed/ # Preprocessed features and datasets
│ └── external/ # External datasets (UniProt, GO annotations, etc.)
├── docs/
│ └── literature/ # Research papers and documentation
├── notebooks/ # Jupyter notebooks for EDA and experiments
├── src/
│ ├── models/ # Model architectures
│ ├── features/ # Feature engineering
│ ├── utils/ # Helper functions
│ └── visualization/ # Plotting utilities
├── submissions/ # Competition submissions
├── models/ # Saved models
└── logs/ # Training logs
- Python 3.10+
- CUDA-capable GPU (recommended)
- Anaconda or Miniconda
- Clone this repository:
git clone <repository-url>
cd cafa-6-protein-function-prediction- Run the setup script:
./setup.shOr manually:
# Create conda environment
conda env create -f environment.yml
# Activate environment
conda activate cafa6
# Install additional dependencies
pip install -r requirements.txt- Deep Learning: PyTorch, TensorFlow, Transformers
- Protein Analysis: BioPython, Fair-ESM, ProteinBERT
- Graph Processing: NetworkX, PyTorch Geometric, DGL
- GO Analysis: GOAtools, Pronto
- ML Tools: Scikit-learn, Optuna, W&B, MLflow
- BLAST-based similarity search
- Feature engineering (amino acid composition, physicochemical properties)
- Traditional ML models (Random Forest, XGBoost)
- Protein Language Models (ESM, ProtBERT)
- Graph Neural Networks for GO hierarchy
- Multi-task learning for GO categories (MF, BP, CC)
- Transformer architectures
- Fmax score (harmonic mean of precision-recall)
- Ontology-aware metrics
- Cross-validation with temporal splits
- Download competition data from Kaggle
- Explore the data using notebooks in
notebooks/ - Run baseline models from
src/models/ - Experiment with advanced architectures
See docs/literature/ for key CAFA papers and methodologies.
This project is for the Kaggle CAFA-6 competition. Please refer to competition rules for usage restrictions.