A modular reimplementation of AF-Cluster for plug-n-play functionality, incorporation into computational workflows, and HPC/HTC deployment through ColabFold.
- Modular Design: Clean separation of MSA generation, clustering, and structure prediction
- Slurm & Apptainer Compatibility: Integration for high-performance and high-throughput computing clusters
- Modular MSA Clustering: Additional clustering methods can be easily swapped in.
- Python 3.8+
- ColabFold (for structure prediction)
- MMseqs2 (optional, for local MSA generation)
-
Clone the repository:
git clone git@github.com:gelnesr/AFCluster-2.git cd AFCluster-2 -
Set up ColabFold locally:
If you do not already have ColabFold installed, please follow the installation instructions from localcolabfold:
# For Linux wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_linux.sh bash install_colabbatch_linux.shAfter setting up ColabFold, set the relative path in the
configs/afcluster.ymlfile. We recommend also setting up a cache directory. -
For HPC deployment, run the setup script:
This will automatically set up the enviornment and set the path for ColabFold to your
$SCRATCH/toolsfolder. Edit this line in the .sh script to the appropriate directory.bash scripts/env_setup.sh
-
For Apptainer deployment, run the following set of commands.
First set up a .sif file to your
$SCRATCH/containersfolder and a cache directory at$SCRATCH/cache. Edit this line in the .sh script to the appropriate directory.bash scripts/build_apptainer.sh
Then, run this which will automatically set up the enviornment and set the path for ColabFold to your
$SCRATCH/toolsfolder. Edit this line in the .sh script to the appropriate directory.bash scripts/env_setup.sh
Then run
module load apptainerormodule load singularityto initialize an apptainer. You should set theINPUT.fastain the command below before running:apptainer exec --nv \ --bind "AFCluster-2:/w","$SCRATCH:$SCRATCH" \ --env XDG_CACHE_HOME="$CACHE" \ --env MPLCONFIGDIR="$CACHE" \ "$IMG" bash -lc ' cd /w source afc/bin/activate python afcluster.py --input INPUT.fasta '
Run the pipeline on a FASTA file:
python afcluster.py --input sequences.fasta--input: Input FASTA file (required)--msa: Pre-computed MSA file (optional)
Modify configs/afcluster.yml to adjust clustering parameters:
keyword: "MAIN"
gap_cutoff: 0.25
random_seed: 42
cluster_method: "dbscan"
dbscan:
min_samples: 10
eps_val: null
min_eps: 3
max_eps: 20.0
eps_step: 0.5
path_vars:
PATH: "/path/to/colabfold/bin:$PATH"
XDG_CACHE_HOME: "/path/to/cache"
MPLCONFIGDIR: "/path/to/cache"- Input Processing: Load FASTA sequences
- MSA Generation: Create multiple sequence alignments using
colabfold_batchor local MMSeqs - Sequence Filtering: Remove sequences with high gap content
- Clustering: Group similar sequences using specified method
- Structure Prediction: Run ColabFold on each cluster
- Output: Generate cluster-specific A3M files and predicted structures with corresponding json/png files
output/
├── sequence_id/
│ ├── sequence_id.a3m # Generated MSA
│ ├── clusters/
│ │ ├── sequence_id_000.a3m # Cluster 0
│ │ ├── sequence_id_001.a3m # Cluster 1
│ │ └── ...
│ └── preds/
│ ├── sequence_id_000/
│ │ ├── s0/ # Structure prediction seed 0
│ │ ├── s1/ # Structure prediction seed 1
│ │ └── ...
│ └── ...
This project is based on the original AF-Cluster implementation.
If you use AFCluster-2 in your research, please cite the following works and acknowledge this implementation:
@article{AFCluster,
title={Predicting multiple conformations via sequence clustering and AlphaFold2},
DOI={10.1038/s41586-023-06832-9},
journal={Nature},
author={Wayment-Steele, Hannah K. and Ojoawo, Adedolapo and Otten, Renee and Apitz, Julia M. and Pitsawong, Warintra and Hömberger, Marc and Ovchinnikov, Sergey and Colwell, Lucy and Kern, Dorothee},
year={2023},
}This software builds upon the following tools and methods:
AF-Cluster - Wayment-Steele, H.K., Ojoawo, A., Otten, R. et al. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature 625, 832–839 (2024). https://doi.org/10.1038/s41586-023-06832-9
[GitHub]
ColabFold - Mirdita, M., Schütze, K., Moriwaki, Y. et al. ColabFold: making protein folding accessible to all. Nat Methods 19, 679–682 (2022). https://doi.org/10.1038/s41592-022-01488-1
[GitHub] | [Local Installation]
MMseqs2 - Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://www.nature.com/articles/nbt.3988 [GitHub]
AlphaFold - Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2