ab-search

SAbDab-backed antigen search and antibody-antigen epitope scoring pipeline.

Given a query protein FASTA, the pipeline searches SAbDab antigen chains with BLAST, downloads matching antibody-antigen complex structures, and reports structure-based epitope identity plus antibody Chothia CDR annotations.

1. Environment

Install the required command-line tools and Python packages:

mamba install -c conda-forge -c bioconda biotite blast pandas biopython anarci

Check the environment:

blastp -version
makeblastdb -version
python - <<'PY'
import pandas
import Bio
import biotite
import anarci
print("environment OK")
PY

2. Prepare Reference Data

Create the reference directory:

mkdir -p reference

Download the SAbDab summary table from:

https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/search/?all=true#downloads

Save it as:

reference/sabdab_summary_all.tsv

Download all PDB SEQRES sequences:

wget -O reference/pdb_seqres.fasta.gz \
  https://files.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.gz

Build the SAbDab antigen-chain FASTA:

python filter_seqs.py \
  --fasta reference/pdb_seqres.fasta.gz \
  --sabdab reference/sabdab_summary_all.tsv \
  --output reference/sabdab_seq.fasta

Build the BLAST database:

makeblastdb \
  -in reference/sabdab_seq.fasta \
  -dbtype prot \
  -out reference/blastdb

3. Run The Pipeline

Use the all-in-one runner:

python run_pipeline.py \
  --query test_input.fasta \
  --outdir results/test_input \
  --threads 4

For a different query:

python run_pipeline.py \
  --query path/to/query.fasta \
  --outdir results/query_name \
  --threads 4

Multi-record FASTA files are processed sequentially:

python run_pipeline.py \
  --query test_multi_input.fasta \
  --outdir results/test_multi_input \
  --threads 4

For multi-record input, each FASTA record ID gets its own output subdirectory:

results/test_multi_input/
|-- M1R/
|-- M1R_copy/
|-- merged_antibodies.csv
`-- query_manifest.tsv

Record IDs are preserved as query IDs. Characters unsuitable for directory names are replaced with _; the runner stops if two record IDs would map to the same directory name.

The runner reuses existing reference files, BLAST DB files, and cached mmCIF structures when they are already present.

4. Main Outputs

For the example command, outputs are written under results/test_input/:

blastp_hits.tsv: BLAST tabular hits.
download_manifest.tsv: downloaded or cached PDB mmCIF files.
structure_scores.csv: per-hit structure scoring table.
merged_antibodies.csv: exact heavy/light antibody sequence groups with support_pdb_ids separated by | and epitope_identity_range summarizing the per-structure identity range for that antibody group.
structure_score_errors.tsv: per-hit scoring failures, if any.
logs/: command logs.

For multi-record input:

<record_id>/: the complete normal output set for one FASTA record.
query_manifest.tsv: maps input record IDs to generated subdirectories and files.
Top-level merged_antibodies.csv: exact antibody sequence groups recomputed from all records. Its query_id_list, support_pdb_ids, support count, and epitope_identity_range summarize the complete multi-query run.

Antibody chain descriptions:

structure_scores.csv includes heavy_chain_description and light_chain_description, read from the corresponding PDB SEQRES FASTA header.
merged_antibodies.csv includes heavy_chain_description_list and light_chain_description_list, merged with the same unique |-separated logic used for antigen_name_list.

5. Scoring Notes

structure_scores.csv reports epitope identity using antibody CDR-contacting antigen residues:

CDRs are defined with Chothia numbering by default (--scheme chothia).
Only antigen residues contacting antibody CDR atoms are counted as epitope residues.
Target antigen-chain epitope residues are aligned to the query sequence and scored as identical or mutated.
If antibody CDRs also contact other non-antibody protein chains, those contact residues are included in the epitope denominator as missing multichain binding residues.
Other chains are excluded from this multichain penalty when their sequence identity to the target antigen chain is at least --homodimer-identity-threshold (default 0.95), treating them as target-chain homodimer copies.
Structures are read from the first biological assembly by default. If no matching CDR-mediated antibody-antigen complex is found there, scoring falls back to the asymmetric unit and reports this in structure_warnings.

Useful scoring parameters:

python run_pipeline.py \
  --query test_input.fasta \
  --outdir results/test_input \
  --ca-distance-threshold 8.0 \
  --atom-distance-threshold 4.5 \
  --homodimer-identity-threshold 0.95 \
  --scheme chothia

6. Manual Structure Scoring

After BLAST and structure download, scoring can be run directly:

python score_structures.py \
  --query-fasta test_input.fasta \
  --hits results/test_input/blastp_hits.tsv \
  --sabdab reference/sabdab_summary_all.tsv \
  --pdb-seqres reference/pdb_seqres.fasta.gz \
  --pdb-cache reference/pdb_files \
  --output results/test_input/structure_scores.csv \
  --merged-antibody-output results/test_input/merged_antibodies.csv \
  --errors-output results/test_input/structure_score_errors.tsv \
  --scheme chothia

7. Find Chain Names

Use find_chain_name.py to annotate plain one-sequence-per-line protein input against any BLAST protein database:

python find_chain_name.py \
  --input sequences.txt \
  --blastdb reference/blastdb \
  --output chain_names.csv \
  --min-identity 0.7 \
  --threads 4

If blastp is not on PATH, pass it explicitly:

python find_chain_name.py \
  --input sequences.txt \
  --blastdb reference/blastdb \
  --output chain_names.csv \
  --blastp /home/jianfc/miniforge3/envs/bioinfo/bin/blastp

The output CSV contains one row per input line, including the best matching subject ID, FASTA description, identity, query coverage, e-value, and bitscore.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.agents/skills		.agents/skills
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
download_pdb.py		download_pdb.py
filter_seqs.py		filter_seqs.py
find_chain_name.py		find_chain_name.py
run_pipeline.py		run_pipeline.py
score_structures.py		score_structures.py
test_input.fasta		test_input.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ab-search

1. Environment

2. Prepare Reference Data

3. Run The Pipeline

4. Main Outputs

5. Scoring Notes

6. Manual Structure Scoring

7. Find Chain Names

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ab-search

1. Environment

2. Prepare Reference Data

3. Run The Pipeline

4. Main Outputs

5. Scoring Notes

6. Manual Structure Scoring

7. Find Chain Names

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages