Rapid & standardized annotation of bacterial genomes, MAGs & plasmids using protein structural information
baktfold is a sensitive annotation tool for bacterial genomes, MAGs & plasmids genomes using protein structural homology.
baktfold is very similar to phold but goes beyond phages to bacterial annotation. baktfold takes all "hypothetical proteins" from Bakta's output and uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a series of databases (SwissProt, AlphaFold Database non-singleton clusters, PDB and CATH).
Additionally, instead of using ProstT5, you can specify protein structures that you have pre-computed for your hypothetical proteins.
You can also specify custom databases to search against using --custom-db
Baktfold is currently under active development. We would welcome any and all feedback (especially bugs) via Issues
If you don't want to install baktfold locally, you can run it without any code using the Google Colab notebook
- The best way to install
baktfoldis using conda, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies - We would highly recommend installing conda via miniforge.
- To install baktfold:
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold - To utilise phold with GPU, a GPU compatible version of pytorch must be installed. By default conda will usually install a CPU-only version.
- If you have an NVIDIA GPU, please try:
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold pytorch=*=cuda*- If you have a Mac with M-series Apple Silicon, you may need to install a particular version of Pytorch to utilise GPU-acceleration
- The same is true if you use other non-NVIDIA e.g. AMD GPUs
- See this link for some more detail and further links
- You can also install baktfold using pip.
pip install baktfold- You will need to have Foldseek (ideally v10.941cd33) installed and available in the $PATH.
- You can install the latest version of baktfold with potentially untested and unreleased changes into a conda environment using conda as follows:
conda create -n baktfoldENV foldseek
conda activate baktfoldENV
git clone https://github.com/gbouras13/baktfold.git
cd baktfold
pip install .
baktfold --help- To download and install baktfold's databases (use as many threads with
-tas you can to speed up downloading)
baktfold install -d baktfold_db -t 8- If you have an NVIDIA GPU, you will need to format the database to allow it to use Foldseek-GPU with
--foldseek-gpu- Note: you can do this after downloading the database with the above command (it won't redownload the database, only do the relevant Foldseek database padding)
baktfold install -d baktfold_db --foldseek-gpu- You will first need to run bakta and use the resulting
.jsonfile as input forbaktfold- We will add other input formats eventually, put we would always recommend running Bakta first, as it is awesome and comprehensive
- To use
baktfold runusing a dummy test example
# with nvidia gpu
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json -o baktfold_output -f -t 8 -d baktfold_db --foldseek-gpu
# without nvidia gpu available
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json -o baktfold_output -f -t 8 -d baktfold_db - To use
baktfold proteinsusing a dummy test example protein.faafile - Note that this can be any
.faa(It does not have to be the output of Bakta)
# with nvidia gpu
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa -o baktfold_proteins_output -f -t 8 -d baktfold_db --foldseek-gpu
# without nvidia gpu available
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa -o baktfold_proteins_output -f -t 8 -d baktfold_db -
The two most useful commands are
baktfold runandbaktfold proteins -
baktfold runaccepts a Bakta json file as its input, and by default, it will annotate all hypothetical CDS and return a variety of Bakta-like compliant output formats. All other annotations will be inherited from the Bakta output -
baktfold proteinsaccepts a protein FASTA.faaformat file as input. It will annotate all protein sequences and return a variety ofbakta_proteins-like output formats -
baktfold predictandbaktfold comparesplitbaktfold runinto the ProstT5 and Foldseek modules, whilebaktfold proteins-predictandbaktfold proteins-comparedo the same forbaktfold proteins(useful if you have non-NVIDIA GPUs) -
It is recommend you run baktfold with a GPU if you can.
-
If you do not have a GPU, baktfold will still run, but the ProstT5 step will be fairly slow.
-
If you have a NVIDIA GPU, you can also use the
--foldseek-gpuparameter to accelerate Foldseek further
Usage: baktfold [OPTIONS] COMMAND [ARGS]...
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
Commands:
citation Print the citation(s) for this tool
compare Runs Foldseek vs baktfold db
install Installs ProstT5 model and baktfold database
predict Uses ProstT5 to predict 3Di tokens - GPU recommended
proteins baktfold protein-predict then comapare all in one - GPU...
proteins-compare Runs Foldseek vs baktfold db on proteins input
proteins-predict Runs ProstT5 on a multiFASTA input - GPU recommended
run baktfold predict then comapare all in one - GPU...Usage: baktfold run [OPTIONS]
baktfold predict then comapare all in one - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Bakta Genbank format or
Bakta JSON format [required]
-o, --output PATH Output directory [default: output_baktfold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: baktfold]
-d, --database TEXT Specific path to installed baktfold database
-f, --force Force overwrites the output directory
--batch-size INTEGER batch size for ProstT5. 1 is usually fastest.
[default: 1]
--cpu Use cpus only.
--omit-probs Do not output per residue 3Di probabilities
from ProstT5. Mean per protein 3Di
probabilities will always be output.
--save-per-residue-embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save-per-protein-embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
--mask-threshold FLOAT Masks 3Di residues below this value of
ProstT5 confidence for Foldseek searches
[default: 25]
-e, --evalue FLOAT Evalue threshold for Foldseek [default:
1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep-tmp-files Keep temporary intermediate files,
particularly the large foldseek_results.tsv
of all Foldseek hits
--max-seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra-sensitive Runs baktfold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra-foldseek-params TEXT Extra foldseek search params
--custom-db TEXT Path to custom database
--foldseek-gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
--custom-annotations PATH Custom Foldseek DB annotations, 2 column tsv.
Column 1 matches the Foldseek headers, column
2 is the description.
-a, --all-proteins annotate all proteins (not just
hypotheticals)- The majority of outputs match bakta.
- Specifically, all the format compliant outputs match bakta's.
- The differences are:
<prefix>.inference.tsvis changed compared to bakta. Inbaktfold, this file gives a quick overview of the differentbaktfolddatabases the query protein has hit (if any)- For example:
ID Length Product Swissprot AFDBClusters PDB CATH
MEGJMNBEGN_27 162 HTH-type quorum-sensing regulator RhlR swissprot_P54292 afdbclusters_A0A9E1VSB0 pdb_5l09 cath_3sztB01
MEGJMNBEGN_30 68 hypothetical protein
MEGJMNBEGN_70 94 hypothetical protein afdbclusters_A0A1I3V7E0 * `<prefix>_<database>_tophit.tsv` files give the detailed Foldseek alignment information for each tophit found for each database.
* For example:
query target bitscore fident evalue qStart qEnd qLen qCov tStart tEnd tLen tCov
MEGJMN_070 AF-A0A1I3V7E0-F1-model_v6 292 0.41 2.619e-06 1 91 93 0.97 1 95 99 0.95* The full Foldseek search outputs are not kept by default (only tophits) - you can keep the full Foldseek search tsvs using `--keep-tmp-files`. They will be called `foldseek_results_<database>.tsv`
* `baktfold_3di.fasta` which gives the 3Di tokens for each input CDS
* `baktfold_prostT5_3di_mean_probabilities.csv` and `baktfold_prostT5_3di_all_probabilities.json`, which give some score of the confidence ProstT5 has in its predictions. You can disable this output with `--omit-probs`
* Baktfold does not have plotting functionality like Bakta (yet)
- As Baktfold inherits annotations from Bakta, please see the explanation in bakta for all other concepts here
- Baktfold adds one conceptual term in addition to Bakta's:
- PSTC: protein structure clusters. These comprise of structure-based annotations to any of Baktfold's databases
-
A manuscript describing
baktfoldis in preparation. -
Please be sure to cite the following core dependencies - citing all bioinformatics tools that you use helps us, so helps you get better bioinformatics tools:
-
Foldseek - (https://github.com/steineggerlab/foldseek) van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology (2023), doi:10.1038/s41587-023-01773-0
-
ProstT5 - (https://github.com/mheinzinger/ProstT5) Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, Burkhard Rost. ProstT5: Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics (2024) doi:10.1101/2023.07.23.550085
-
Please also consider citing these databases where relevant:
-
AFDB/SwissProt - Mihaly Varadi, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, Malarvizhi Radhakrishnan, Maxim Tsenkov, Sreenath Nair, Milot Mirdita, Jingi Yeo, Oleg Kovalevskiy, Kathryn Tunyasuvunakool, Agata Laydon, Augustin Žídek, Hamish Tomlinson, Dhavanthi Hariharan, Josh Abrahamson, Tim Green, John Jumper, Ewan Birney, Martin Steinegger, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D368–D375, https://doi.org/10.1093/nar/gkad1011
-
CATH - Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH--a hierarchic classification of protein domain structures. Structure. 1997 Aug 15;5(8):1093-108. doi: 10.1016/s0969-2126(97)00260-8. PMID: 9309224.
-
PDB - H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank (2000) Nucleic Acids Research 28: 235-242 https://doi.org/10.1093/nar/28.1.235