Note
Documentation / Documentação:
- For the Brazilian Portuguese version of this file, see README.PTBR.md.
- To access the detailed user manual, see MANUAL.md.
This tool is an updated version of ATTILA, originally published in:
Discovering Selected Antibodies From Deep-Sequenced Phage-Display Antibody Library Using ATTILA
Andréa Queiroz Maranhão, Heidi Muniz Silva, Waldeyr Mendes Cordeiro da Silva, Renato Kaylan Alves França, Thais Canassa De Leo, Marcelo Dias-Baruffi, Rafael Trindade Burtet, Marcelo Macedo Brigido
Bioinformatics and Biology Insights, 2020. DOI: 10.1177/1177932220915240
Phage display is a powerful technique to select high-affinity antibodies for different purposes, including biopharmaceuticals. Next-generation sequencing (NGS) presented itself as a robust solution, making it possible to assess billions of sequences of the variable domains from selected sublibraries. Handling this process, a central difficulty is to find the selected clones. Here, we present the AutomaTed Tool For Immunoglobulin Analysis (ATTILA), a new tool to analyze and find the enriched variable domains throughout a biopanning experiment. The ATTILA is a workflow that combines publicly available tools and in-house programs and scripts to find the fold-change frequency of deeply sequenced amplicons generated from selected VH and VL domains. We analyzed the same human Fab library NGS data using ATTILA in 5 different experiments, as well as on 2 biopanning experiments regarding performance, accuracy, and output. These analyses proved to be suitable to assess library variability and to list the more enriched variable domains, as ATTILA provides a report with the amino acid sequence of each identified domain, along with its complementarity-determining regions (CDRs), germline classification, and fold change. Finally, the methods employed here demonstrated a suitable manner to combine amplicon generation and NGS data analysis to discover new monoclonal antibodies (mAbs).
ATTILA (AutomaTed Tool For Immunoglobulin Analysis) is a bioinformatics pipeline designed to search and select candidate clones of immunoglobulins (VH and VL) from libraries generated by Phage Display experiments. Originally built with a combination of Perl, R statistical scripts, and proprietary C binaries, the system has been entirely rewritten and consolidated into native Python 3 to guarantee full cross-platform portability (Windows, Linux, macOS), robust step execution, and independence from unnecessary additional interpreters.
- Paired-End Reads Assembly (Join): Merging of forward and reverse sequences into a single contiguous fragment via
fastq-join. - Filtering and Quality Control (Filter): Automatic removal of short or low-quality reads via
prinseq-liteand generation of visual statistics viafastqc. - 6-Frame Local Translation and ORF Detection (Translate): Native Python translation of nucleotide sequences to amino acids in the 6 possible reading frames, filtering by valid immunoglobulin ORFs containing conserved Cysteines and the FR4 motif (
WG.Gfor VH,FG.Gfor VL). - Relative Frequency and Enrichment Calculation (Frequency): Abundance counting of selected CDR3s and automatic Fold Change calculation of enrichment between the initial round (R0) and final round (RN).
- Aligned Residue Numbering (Number): Automated access to the Kabat numbering scheme of antibodies via the UCL Abnum API.
- Nucleotide Sequence Recovery (NT-Recovery): Retrieval of original nucleotide sequences corresponding to the selected clones by mapping alignment coordinates.
- Statistical Proportion Tests (Stats): Native Python statistical calculation of the single-tailed Z-test of proportion differences and 95% confidence intervals for enrichment, applying Bonferroni correction.
- Germline Classification (Germline): Alignment and germline gene assignment using local
igblastp. - Consolidated Visual Report (Report): Compilation of complete CDR/FR region tables for candidate clones, read loss data, and charts into a standalone interactive HTML report.
- Pipeline Verification and Demo (--example): Built-in synthetic dataset generation and pipeline validation that runs all compatible steps based on the host environment dependencies.
The pipeline operates utilizing the filesystem itself in structured formats (FASTQ, FASTA, CSV, TXT) for storing raw reads, intermediate results, and metrics. The data representation below details this ecosystem.
erDiagram
FASTQ_Reads ||--o{ FASTA_Proteins : "translated to"
FASTA_Proteins ||--o{ FASTA_Alignment : "Kabat aligned"
FASTA_Alignment ||--o{ TXT_Statistics : "statistically tested"
CSV_Counting }o--|| FASTQ_Reads : "quantifies reads in"
| Field | Type | Description |
|---|---|---|
| id | TEXT | Unique read identifier generated by the sequencer (header). |
| seq | TEXT | DNA nucleotide sequence (A, C, T, G, N bases). |
| qual | TEXT | Phred quality ASCII string corresponding to each nucleotide. |
| Field | Type | Description |
|---|---|---|
| id | TEXT | Read identifier associated with the frame (e.g., seq_id|FRAME:1+). |
| seq | TEXT | Full sequence of the translated variable domain of amino acids. |
| cdr3_seq | TEXT | Isolated sequence of the identified CDR3 loop. |
| Field | Type | Description |
|---|---|---|
| library | TEXT | Name of the corresponding library (R0, RN, or Selected). |
| reads | INTEGER | Number of reads/sequences remaining after processing. |
| step | TEXT | Corresponding pipeline step (raw, joining, filtering, translation, frequency, enrichment, numeration). |
| Field | Type | Description |
|---|---|---|
| id | TEXT | Candidate clone identifier. |
| pvalue | REAL | P-value of the single-tailed Z-test of proportion enrichment. |
| infIC | REAL | Lower bound of the 95% confidence interval of the difference. |
| supIC | REAL | Upper bound of the 95% confidence interval of the difference. |
graph TD
run[run.sh: Bash Wrapper] -->|calls| attila[attila.py: Main Orchestrator]
attila -->|1. Join| fjoin[fastq-join: PE reads merger]
attila -->|2. Filter| fqc[fastqc: Quality control]
attila -->|2. Filter| prinseq[prinseq-lite: Quality filtering]
attila -->|3. Translate| trans[attila.py - translate_all: Frame & ORF translation]
attila -->|4. Frequency| freq[attila.py - frequency_counter: Counts & Fold Change]
attila -->|5. Number| num[attila.py - numberab: Kabat UCL numbering]
attila -->|6. NT-Recovery| rec[attila.py - get_ntsequence: NT recovery]
attila -->|7. Stats| stat[attila.py - calculate_z_test: Statistical Z-test]
attila -->|8. Germline| igb[igblastp: Germline alignment]
attila -->|9. Report| rep[html_creator.py: HTML Report Generator]
| Technology | Version | Description |
|---|---|---|
| Python | 3.11+ |
Orchestration language and main biological processing logic. |
| Bash | 4.0+ |
Friendly command-line wrapper script for setup and execution. |
| FastQC | 0.11+ |
Visual quality control of biological sequences. |
| Prinseq-lite | 0.20+ |
Quality filtering and trimming of reads. |
| Fastq-join | 1.01+ |
Assembly/merging of paired-end forward and reverse reads. |
| IgBlast | 1.14.0+ |
Alignment against human/mouse germline databases. |
| Matplotlib | 3.9+ |
Native plotting of statistical charts in report (Optional - fallback to R). |
| NumPy | 2.0+ |
Numerical operations required by Matplotlib. |
| R / Rscript | 4.0+ |
Alternative interpreter for statistical charts (Optional - ggplot2, scales). |
| Bootstrap | 5.3.0 |
CSS framework used for the modern design and responsiveness of the HTML Report. |
| Bootstrap Icons | 1.10.5 |
Vector icon library used for collapsible panels in the report. |
| Google Fonts (Inter) | N/A |
Modern font family used to enhance scientific report readability. |
| Feature | Form Field / Argument | Database Field / Configuration | Applied Rules |
|---|---|---|---|
| Orchestration | --config my_project.cfg |
Input .cfg file containing paths. |
The configuration file must exist and contain valid paths within the filesystem. |
| Feature | Form Field / Argument | Database Field / Configuration | Applied Rules |
|---|---|---|---|
| Modular Execution | --steps join,filter,translate |
ATTILA_STEPS env |
The pipeline will run only the specified comma-separated sub-steps. |
| Feature | Form Field / Argument | Database Field / Configuration | Applied Rules |
|---|---|---|---|
| Chain Definition | --type vh, --type vl or --type both |
libtype |
Allows running VH (0), VL (1) or both configurations sequentially and automatically. |
| Feature | Form Field / Argument | Database Field / Configuration | Applied Rules |
|---|---|---|---|
| Demo Run | --example |
Environment detection | Runs VH and VL pipelines sequentially on synthetic data. Automatically falls back to Python-only steps if CLI bioinformatics dependencies are missing. |
- Python 3.8+
- CLI Dependencies: FastQC, Prinseq-lite, Fastq-join, and IgBlast (optional).
ATTILA includes a cross-platform installer that detects your OS and installs all dependencies automatically:
git clone https://github.com/waldeyr/attila.git
cd attila
chmod +x install.sh run.sh
./install.shSupported platforms:
| Platform | Package manager | Notes |
|---|---|---|
| macOS | Homebrew (brew) |
fastq-join compiled from source (ea-utils) |
| Debian / Ubuntu | apt |
All tools available as packages |
| RHEL / Fedora / CentOS / Rocky / AlmaLinux | dnf / yum |
Enables EPEL automatically |
IgBlast (germline step) is optional. The installer downloads and installs it automatically, but the pipeline runs without it — the germline classification step is simply skipped.
If you prefer to install dependencies yourself:
- Clone the ATTILA repository:
git clone https://github.com/waldeyr/attila.git cd attila - Install Python dependencies:
pip install -r requirements.txt
- Install CLI tools:
- macOS:
brew install fastqc(fastq-join must be compiled from source — seeinstall.sh) - Debian/Ubuntu:
sudo apt install fastqc ea-utils perl - RHEL/Fedora/Rocky:
sudo dnf install epel-release && sudo dnf install fastqc ea-utils perl - prinseq-lite (all platforms): download the Perl script from SourceForge and place it in your PATH.
- macOS:
- Set execution permissions:
chmod +x run.sh
Run the full pipeline using a configuration file:
./run.sh --config my_project_VH.cfg --allTo run only specific steps:
./run.sh --config my_project_VH.cfg --steps filter,translateTo start the interactive configuration wizard:
./run.sh --interactiveTo run the pipeline with synthetic demo data to verify the installation:
./run.sh --exampleNote: If fastqc or prinseq-lite is missing from your PATH, the wrapper will automatically switch to a fallback mode, copying pre-filtered synthetic FASTA files and running all Python-only processing stages. This ensures you can verify and explore the pipeline output without installing additional command-line tools. Results are saved to example_output/example_project_results/.
run.sh automatically runs installation pre-flight checks before every execution. To run the checks manually:
python3 programs/test_attila.py TestInstallation -vTo run the full test suite (unit + integration):
python3 programs/test_attila.py -v| Test class | What it verifies |
|---|---|
TestInstallation |
Python version, pip packages, CLI tools in PATH, required data files |
TestAttilaPipeline |
DNA functions, 6-frame translation, Z-test statistics, full VH+VL pipeline end-to-end |
Docker containerizes all bioinformatics dependencies (FastQC, prinseq-lite, fastq-join, IgBlast, and Python requirements), removing the need for manual local installation. The image is based on python:3.13-slim-bookworm and the build validates all tools automatically via TestInstallation.
From the repository root directory, run:
docker build -t attila:latest .The build runs
python3 programs/test_attila.py TestInstallation -vas a final step. If any required tool is missing from the image, the build fails with a clear error.
Mount your working directory with -v so that input files and results are accessible on your host machine.
docker run --rm -v "$(pwd):/app/shared" attila:latest ./run.sh --exampleResults are saved inside the container at /app/example_output/ and mirrored to $(pwd)/example_output/ on your host.
docker run -it --rm -v "$(pwd):/app/shared" attila:latest ./run.sh --interactiveSave your project inside /app/shared so that results persist on your host.
docker run --rm -v "$(pwd):/app/shared" attila:latest \
./run.sh --config /app/shared/my_project_VH.cfg --type bothdocker run --rm attila:latest python3 programs/test_attila.py -vdocker run -it --rm -v "$(pwd):/app/shared" attila:latest bash