Skip to content

atfrank/cafa-6-protein-function-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAFA-6 Protein Function Prediction

This repository contains code and analysis for the CAFA-6 (Critical Assessment of Function Annotation) Protein Function Prediction competition on Kaggle.

Competition Overview

CAFA-6 is a competition focused on predicting protein functions from sequences using Gene Ontology (GO) and Human Phenotype Ontology (HPO) terms. The goal is to develop computational methods that can accurately predict the biological functions of proteins whose functions have not yet been experimentally determined.

Project Structure

cafa-6-protein-function-prediction/
├── data/
│   ├── raw/           # Original competition data
│   ├── processed/     # Preprocessed features and datasets
│   └── external/      # External datasets (UniProt, GO annotations, etc.)
├── docs/
│   └── literature/    # Research papers and documentation
├── notebooks/         # Jupyter notebooks for EDA and experiments
├── src/
│   ├── models/        # Model architectures
│   ├── features/      # Feature engineering
│   ├── utils/         # Helper functions
│   └── visualization/ # Plotting utilities
├── submissions/       # Competition submissions
├── models/           # Saved models
└── logs/             # Training logs

Setup

Prerequisites

  • Python 3.10+
  • CUDA-capable GPU (recommended)
  • Anaconda or Miniconda

Installation

  1. Clone this repository:
git clone <repository-url>
cd cafa-6-protein-function-prediction
  1. Run the setup script:
./setup.sh

Or manually:

# Create conda environment
conda env create -f environment.yml

# Activate environment
conda activate cafa6

# Install additional dependencies
pip install -r requirements.txt

Key Dependencies

  • Deep Learning: PyTorch, TensorFlow, Transformers
  • Protein Analysis: BioPython, Fair-ESM, ProteinBERT
  • Graph Processing: NetworkX, PyTorch Geometric, DGL
  • GO Analysis: GOAtools, Pronto
  • ML Tools: Scikit-learn, Optuna, W&B, MLflow

Approach

1. Baseline Methods

  • BLAST-based similarity search
  • Feature engineering (amino acid composition, physicochemical properties)
  • Traditional ML models (Random Forest, XGBoost)

2. Advanced Methods

  • Protein Language Models (ESM, ProtBERT)
  • Graph Neural Networks for GO hierarchy
  • Multi-task learning for GO categories (MF, BP, CC)
  • Transformer architectures

3. Evaluation

  • Fmax score (harmonic mean of precision-recall)
  • Ontology-aware metrics
  • Cross-validation with temporal splits

Getting Started

  1. Download competition data from Kaggle
  2. Explore the data using notebooks in notebooks/
  3. Run baseline models from src/models/
  4. Experiment with advanced architectures

References

See docs/literature/ for key CAFA papers and methodologies.

License

This project is for the Kaggle CAFA-6 competition. Please refer to competition rules for usage restrictions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published