This is the CAFA 5 dataset of 142k protein sequences annotated with their gene ontology (GO) terms. The samples are divided into three subsets each containing a set of GO terms that are associated with one of the three subgraphs of the gene ontology - Molecular Function, Biological Process, and Cellular Component. In addition, we provide a stratified train/test split that utilizes term embeddings to distribute term labels equally. The term embeddings are included in the dataset and can be used to stratify custom splits or to search for sequences with similar gene ontologies.
The code to export this dataset can be found here.
The CAFA 5 dataset is available on HuggingFace Hub and can be loaded using the HuggingFace Datasets library.
The dataset is divided into three subsets according to the GO terms that the sequences are annotated with.
all- All annotationsmf- Only molecular function termscc- Only celluar component termsbp- Only biological process terms
To load the default CAFA 5 dataset with all function annotations you can use the example below.
from datasets import load_dataset
dataset = load_dataset("andrewdalpino/CAFA5")To load a subset of the CAFA 5 dataset use the example below.
dataset = load_dataset("andrewdalpino/CAFA5", "mf")We provide a 90/10 train and test split for your convenience. The subsets were determined using a stratified approach which assigns cluster numbers to sequences based on their terms embeddings. We've included the stratum IDs so that you can generate additional custom stratified splits as shown in the example below.
from datasets import load_dataset
dataset = load_dataset("andrewdalpino/CAFA5", split="train")
dataset = dataset.class_encode_column("stratum_id")
dataset = dataset.train_test_split(test_size=0.2, stratify_by_column="stratum_id")You can also filter the samples of the dataset like in the example below.
dataset = dataset.filter(lambda sample: sample["length"] <= 2048)Some tasks may require you to tokenize the amino acid sequences. In this example, we loop through the samples and add a tokens column to store the tokenized sequences.
def tokenize(sample: dict): list[int]:
tokens = tokenizer.tokenize(sample["sequence"])
sample["tokens"] = tokens
return sample
dataset = dataset.map(tokenize, remove_columns="sequence")Iddo Friedberg, Predrag Radivojac, Clara De Paolis, Damiano Piovesan, Parnal Joshi, Walter Reade, and Addison Howard. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023. Kaggle.