SAbDab-backed antigen search and antibody-antigen epitope scoring pipeline.
Given a query protein FASTA, the pipeline searches SAbDab antigen chains with BLAST, downloads matching antibody-antigen complex structures, and reports structure-based epitope identity plus antibody Chothia CDR annotations.
Install the required command-line tools and Python packages:
mamba install -c conda-forge -c bioconda biotite blast pandas biopython anarciCheck the environment:
blastp -version
makeblastdb -version
python - <<'PY'
import pandas
import Bio
import biotite
import anarci
print("environment OK")
PYCreate the reference directory:
mkdir -p referenceDownload the SAbDab summary table from:
https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/search/?all=true#downloads
Save it as:
reference/sabdab_summary_all.tsv
Download all PDB SEQRES sequences:
wget -O reference/pdb_seqres.fasta.gz \
https://files.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.gzBuild the SAbDab antigen-chain FASTA:
python filter_seqs.py \
--fasta reference/pdb_seqres.fasta.gz \
--sabdab reference/sabdab_summary_all.tsv \
--output reference/sabdab_seq.fastaBuild the BLAST database:
makeblastdb \
-in reference/sabdab_seq.fasta \
-dbtype prot \
-out reference/blastdbUse the all-in-one runner:
python run_pipeline.py \
--query test_input.fasta \
--outdir results/test_input \
--threads 4For a different query:
python run_pipeline.py \
--query path/to/query.fasta \
--outdir results/query_name \
--threads 4Multi-record FASTA files are processed sequentially:
python run_pipeline.py \
--query test_multi_input.fasta \
--outdir results/test_multi_input \
--threads 4For multi-record input, each FASTA record ID gets its own output subdirectory:
results/test_multi_input/
|-- M1R/
|-- M1R_copy/
|-- merged_antibodies.csv
`-- query_manifest.tsv
Record IDs are preserved as query IDs. Characters unsuitable for directory names
are replaced with _; the runner stops if two record IDs would map to the same
directory name.
The runner reuses existing reference files, BLAST DB files, and cached mmCIF structures when they are already present.
For the example command, outputs are written under results/test_input/:
blastp_hits.tsv: BLAST tabular hits.download_manifest.tsv: downloaded or cached PDB mmCIF files.structure_scores.csv: per-hit structure scoring table.merged_antibodies.csv: exact heavy/light antibody sequence groups withsupport_pdb_idsseparated by|andepitope_identity_rangesummarizing the per-structure identity range for that antibody group.structure_score_errors.tsv: per-hit scoring failures, if any.logs/: command logs.
For multi-record input:
<record_id>/: the complete normal output set for one FASTA record.query_manifest.tsv: maps input record IDs to generated subdirectories and files.- Top-level
merged_antibodies.csv: exact antibody sequence groups recomputed from all records. Itsquery_id_list,support_pdb_ids, support count, andepitope_identity_rangesummarize the complete multi-query run.
Antibody chain descriptions:
structure_scores.csvincludesheavy_chain_descriptionandlight_chain_description, read from the corresponding PDB SEQRES FASTA header.merged_antibodies.csvincludesheavy_chain_description_listandlight_chain_description_list, merged with the same unique|-separated logic used forantigen_name_list.
structure_scores.csv reports epitope identity using antibody CDR-contacting
antigen residues:
- CDRs are defined with Chothia numbering by default (
--scheme chothia). - Only antigen residues contacting antibody CDR atoms are counted as epitope residues.
- Target antigen-chain epitope residues are aligned to the query sequence and scored as identical or mutated.
- If antibody CDRs also contact other non-antibody protein chains, those contact residues are included in the epitope denominator as missing multichain binding residues.
- Other chains are excluded from this multichain penalty when their sequence
identity to the target antigen chain is at least
--homodimer-identity-threshold(default0.95), treating them as target-chain homodimer copies. - Structures are read from the first biological assembly by default. If no
matching CDR-mediated antibody-antigen complex is found there, scoring falls
back to the asymmetric unit and reports this in
structure_warnings.
Useful scoring parameters:
python run_pipeline.py \
--query test_input.fasta \
--outdir results/test_input \
--ca-distance-threshold 8.0 \
--atom-distance-threshold 4.5 \
--homodimer-identity-threshold 0.95 \
--scheme chothiaAfter BLAST and structure download, scoring can be run directly:
python score_structures.py \
--query-fasta test_input.fasta \
--hits results/test_input/blastp_hits.tsv \
--sabdab reference/sabdab_summary_all.tsv \
--pdb-seqres reference/pdb_seqres.fasta.gz \
--pdb-cache reference/pdb_files \
--output results/test_input/structure_scores.csv \
--merged-antibody-output results/test_input/merged_antibodies.csv \
--errors-output results/test_input/structure_score_errors.tsv \
--scheme chothiaUse find_chain_name.py to annotate plain one-sequence-per-line protein input
against any BLAST protein database:
python find_chain_name.py \
--input sequences.txt \
--blastdb reference/blastdb \
--output chain_names.csv \
--min-identity 0.7 \
--threads 4If blastp is not on PATH, pass it explicitly:
python find_chain_name.py \
--input sequences.txt \
--blastdb reference/blastdb \
--output chain_names.csv \
--blastp /home/jianfc/miniforge3/envs/bioinfo/bin/blastpThe output CSV contains one row per input line, including the best matching subject ID, FASTA description, identity, query coverage, e-value, and bitscore.