The repository is an official implementation of Advancing protein evolution with inverse folding models integrating structural and evolutionary constraints.
AiCE is an approach that optimizes protein function by incorporating structural and evolutionary constraints into the process of AI-assisted mutation nomination. It is compatible with widely used protein inverse folding models such as ProteinMPNN, LigandMPNN, ESM-IF1, SaProt, and others. A demo for nominating high-fitness (HF) mutations using AiCE-ProteinMPNN is provided in this repository.
Table of contents
This method nominates mutations based on sampling inverse folding protein sequences. It works with common protein inverse folding models. A demonstration using AiCE-ProteinMPNN for monomeric proteins is provided in this repository. For mutation prediction in protein complexes, the script can be modified or customized accordingly, as the underlying principle is similar. Alternatively, the complex structure can be decomposed into multiple substructures, and mutations can be predicted for each separately. This strategy effectively reduces computational demands. For detailed methodology, please refer to our paper.
To run AiCE, you will need:
- Python: Version ≥ 3.8
- Libraries: PyTorch, NumPy, SciPy, and Pandas
- Biological Sequence Analysis: Biopython
- PDB File Handling: ProDy
Dependencies can be installed directly using the provided requirements.txt file.
# Clone the AiCE repository
git clone https://github.com/ScorpioLea/AiCE
cd AiCEconda create -n AiCE python=3.11
conda activate AiCE
pip3 install -r requirements.txtThe mkdssp program is required for secondary structure annotation in AiCE.
✅ Option 1: Use the pre-installed script (if available)
This repository includes a pre-downloaded mkdssp binary in the scripts/ directory. To make it executable, run:
sudo chmod 755 scripts/mkdsspNote
The mkdssp executable provided in the AiCE/scripts/ directory may fail to run properly on certain systems due to compiler or library version mismatches.
✅ Option 2: Compile DSSP manually from source
git clone https://github.com/PDB-REDO/dssp.git
cd dssp
cmake -S . -B build
cmake --build build
cmake --install build
rm scripts/mkdssp
ln -s $(pwd)/build/mkdssp scripts/mkdssp
chmod +x build/mkdssp✅ Option 3: Install via Conda
conda install conda-forge::dsspAfter installation, create a symbolic link from the conda-installed binary to the scripts/ directory:
ln -sf ~/anaconda3/envs/AiCE/bin/mkdssp scripts/mkdsspwget -c https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20241022.zip
unzip -d scripts/plink/ plink_linux_x86_64_20241022.zip
rm plink_linux_x86_64_20241022.zip- Inverse folding models:
An inverse folding model is required to output structure-compatible sequences from a given protein structure. For demonstration purposes, we use ProteinMPNN—a lightweight model based on graph neural networks. A pre-deployed version of ProteinMPNN is provided in the
scriptsfolder. - Secondary structure prediction: The DSSP algorithm is used to predict the protein secondary structure. The repository includes the mkdssp module (version 4.4.7).
- Linkage Disequilibrium (LD) calculation:
Plink is used to calculate the LD score. We provide a deployment workflow for plink version v1.9.0-b.7.7. Note that plink v2.0 is not compatible with our workflow by default; you may need to modify
scripts/02.caculated_ld.pyto use plink v2.0 or later. - Evolutionary coupling analysis (SCA) : The repository contains a modified version of the pySCA module (originally from pySCA) to calculate amino acid evolutionary coupling effects.
A demo notebook (AiCE_demo.ipynb) is provided for a simple demonstration.
Before running this demo, you need to convert your structure file to .pdb format if it is currently in .cif format. We provide a conversion tool at scripts/cif2pdb.py. You can run the following command to see more details:
python scripts/cif2pdb.py --helpChange to the example directory to get started:
cd example/
The scripts in this repository use relative paths; you may modify them according to your specific requirements.
Run the following script to nominate single mutations using a protein inverse folding model:
bash ../scripts/01.single_mut_prediction.sh <scripts_dir> <input_folder> <beta> <gamma> [output_folder]
<scripts_dir>: Directory containing the necessary sub-scripts (by default, thescriptsfolder).<input_folder>: Folder containing input structure files (PDB/mmCIF file). The script automatically searches for these files and outputs the nominated single mutations to[output_folder]using the same file prefix as the structure file.<beta>and<gamma>: Screening thresholds for global occurrence and flexible region occurrence, respectively. We recommend 0.8 and 0.5 as general thresholds ("AiCE filtering").[output_folder]: (Optional) Folder for storing output results; the default is../output.
Example:
bash ../scripts/01.single_mut_prediction.sh ../scripts ./ 0.8 0.5
Note
Use bash (not sh) to execute the script to avoid unnecessary errors.
An alternative script automatically recommends <beta> and <gamma> values based on the input structure:
bash scripts/01.single_mut_Auto_prediction.sh <input_folder> [output_folder]
Additionally, the scripts/inverse_MPNN.sh provides a ProteinMPNN-based inverse folding workflow. You can adjust parameters such as num_seq_per_target and sampling_temp to specify the number of output sequences and the sampling temperature. We also provide a script for converting multiple sequence alignments (MSA) into position-specific scoring matrices (PSSM). The script is located at scripts/msa_to_pssm.py. It should be run after completing the first step (01.single_mut_prediction.sh), using the generated .fa file as input.
Note
We believe that current evaluations of combinations, whether in silico or in vivo, are limited in data volume. Moreover, the cases provided in this work also show that combination effects can be unstable. Therefore, we maintain a cautious yet positive attitude toward AiCE-multi. We are also actively developing an upgraded version to further improve the prediction accuracy and stability of combination designs.
Construct the LD matrix based on the inverse folding output sequences:
python ../scripts/02.caculated_ld.py <seq_dir> <output_ld_dir>
<seq_dir>: Directory containing the inverse folding sequences with a .fa extension.<output_ld_dir>: Directory where the LD matrix files will be saved.
The script automatically searches for .fa files in <seq_dir>, predicts the LD matrix, and outputs files with the same prefix as the input. Output files include:
.ld: Linkage disequilibrium matrix (derived from pseudo-reverse translated sequences).vcf: File recording mutation information
Example:
python ../scripts/02.caculated_ld.py ../output/ ../output
For more details, please refer to the accompanying manuscript.
Generate the Statistical Coupling Analysis (SCA) matrix:
bash ../scripts/03.caculated_sca.sh <script_dir> <input_dir> <output_dir>
<input_dir>: Directory containing the inverse folding sequences with a .fa extension.<script_dir>: Directory containing the sub-scripts for generating the evolutionary coupling matrix.<output_ld_dir>: Directory where the output files will be stored.
The script automatically searches for .fa files in <input_dir>, calls the necessary sub-scripts in <script_dir>, and outputs the results with the same file prefix as the input. Output files include:
.sca_matrix.tsv: Amino acid evolutionary coupling matrix..db: Binary file.
Example:
bash ../scripts/03.caculated_sca.sh ../scripts/pySCA/ ../output ../output
The folder ../scripts/pySCA/ contains the modified pySCA scripts.
Nominate multi-mutations using the LD and SCA matrices:
bash ../scripts/04.com_mut_prediction.sh <script_dir> <input_dir> <number-or-list> <output_dir>
• The script automatically searches <input_dir> for the .fa, .sca_matrix.tsv, .ld, .comb, and .vcf files produced in steps 1, 2, and 3.
• It outputs the SCA and LD scores for multi-mutations to files ending in .sca.result and .ld.result, respectively. The output file prefix will match the corresponding input file.
-
<number-or-list>: If a number is provided, the script will iterate through all mutation types of that order;If a list is provided (e.g., "1 3 5"), only the scores for the specified mutation combinations will be output.• Each line in the list represents a mutation combination.
• Positions are space-separated using 1-based indexing.
• Lines that are empty or contain unparseable non-integer characters will be skipped or marked as invalid (output as
None NaN).
The output file format is as follows:
Mutation Type Mean Pairwise score: score Log Mean Pairwise score: score Logical Flag (0/1)
The fourth column indicates whether the mutation combination is recommended based on the screening thresholds (1 for recommended, 0 for not recommended). By default, the screening thresholds are set to 0.5 for LD scores and 0.9 for SCA score percentiles. You can customize these thresholds by providing an additional -t parameter (see lines 106–124 in scripts/04.com_mut_prediction.sh) directly.
Example:
bash ../scripts/04.com_mut_prediction.sh ../scripts ../output/ 2 ../output
This command will iterate over all double mutation combinations and nominate HF mutations.
If you use this code, please cite:
@article{Li2025AiCE,
title={Advancing protein evolution with inverse folding models integrating structural and evolutionary constraints},
author={Fei, Hongyuan and Li, Yunjia and Liu, Yijing and Wei, Jingjing and Chen, Aojie and Gao, Caixia},
journal={Cell},
year={2025},
publisher={Cell Press}
This repository incorporates code from:
This project is licensed under the MIT License. Please note that portions of this project are covered by one or more patents.
Using this software for research or educational purposes is permitted under the MIT License.
However, commercial use of the patented invention requires explicit permission from the patent holder.
For licensing inquiries, please contact [cxgao@genetics.ac.cn].
By using this software, you acknowledge and agree to comply with all applicable patent laws and regulations.