AOC is a reproducible, modular Snakemake workflow for ortholog-aware evolutionary analysis of protein-coding genes.
AOC integrates:
- Codon-aware alignment
- Phylogenetic reconstruction
- Recombination detection
- Branch labeling
- HyPhy-based molecular evolution analyses
- Automated summarization and reporting
The workflow is designed for both local execution and HPC environments, and scales across many ortholog datasets via a samples.csv configuration.
While individual HyPhy analyses can be run through DataMonkey or the HyPhy command line, AOC is designed for reproducible large-scale analyses across many genes or datasets. It automates alignment preparation, phylogenetic inference, branch labeling, multiple selection tests, and standardized result aggregation within a single workflow.
- Codon-aware alignments (MACSE2)
- Phylogenetic inference (IQ-TREE)
- Recombination detection (HyPhy GARD)
- Comprehensive selection inference:
- FEL
- MEME
- CFEL
- RELAX
- aBSREL
- BUSTED-S-MH
- Branch labeling workflows
- JSON parsing and summarization
- Automated run manifest generation
- Stable environment installer (
install.sh)
AOC/
├── workflow/
│ ├── Snakefile
├── config/
│ └── config.yaml
├── scripts/
├── tests/
├── envs/
│ └── AOC.yaml
├── install.sh
├── run_AOC.sh
├── submit_AOC.slurm
└── README.md
AOC provides a stable installer that:
- Detects micromamba / conda / mamba safely
- Avoids broken mamba installations
- Creates or updates environments
- Falls back to Python-only mode if solver fails
- Performs smoke testing
Clone the repository
git clone https://github.com/aglucaci/AOC.git
cd AOCRun installation script
bash install.sh AOC envs/AOC.yaml
FRONTEND_OVERRIDE=conda bash install.sh AOC envs/AOC.yaml
FRONTEND_OVERRIDE=micromamba bash install.sh AOC envs/AOC.yamlAfter installation:
conda activate AOC
Run the automated test workflow to confirm that the environment and setup are working correctly:
bash tests/test_installation.shAOC is driven by a samples.csv file.
sample,codon_fasta,sequence_labels_csv
BDNF-8,data/BDNF/BDNF-8.fasta,data/BDNF/BDNF-8.sequence_labels.csv
tiny-no-labels,tests/data/tiny.fasta,
Each row corresponds to one ortholog dataset. The current workflow requires the
sequence_labels_csv column to be present in samples.csv, even when you do
not want to provide branch labels for a sample.
If you do not have foreground/background labels for a sample yet, leave the third column blank:
sample,codon_fasta,sequence_labels_csv
tiny-no-labels,tests/data/tiny.fasta,
If you do provide a sequence_labels_csv file, format it with label
corresponding to the branch label and fasta_sequence_header corresponding to
the FASTA header description:
label,fasta_sequence_header
Test,"NM_001709.5 Homo sapiens brain derived neurotrophic factor (BDNF), transcript variant 4, mRNA"
Test,"NM_001270630.1 Rattus norvegicus brain-derived neurotrophic factor (Bdnf), transcript variant 1, mRNA"
Background,"XM_011226480.3 PREDICTED: Ailuropoda melanoleuca brain derived neurotrophic factor (BDNF), mRNA"
Background,"XM_007497196.2 PREDICTED: Monodelphis domestica brain-derived neurotrophic factor (BDNF), transcript variant X1, mRNA"
Background,"NM_001081787.1 Equus caballus brain derived neurotrophic factor (BDNF), mRNA"
Branches labeled “Test” represent the foreground lineages where a specific evolutionary hypothesis (e.g., adaptive selection) is being evaluated, while “Background” branches represent the remainder of the phylogeny and serve as a reference group against which evolutionary patterns in the Test set are compared.
The sample value becomes the output directory name under results/, so it is
usually best to keep it aligned with the input dataset name.
From the repository root directory:
bash run_AOC.sh --samples samples.csvThis example uses a real dataset included in the repository.
cat > samples.csv <<'EOF'
sample,codon_fasta,sequence_labels_csv
BDNF-8,data/BDNF/BDNF-8.fasta,data/BDNF/BDNF-8.sequence_labels.csv
EOF
bash run_AOC.sh --samples samples.csvIf you want to run without branch labels, keep the third column header and leave the value empty:
cat > samples.csv <<'EOF'
sample,codon_fasta,sequence_labels_csv
tiny-no-labels,tests/data/tiny.fasta,
EOF
bash run_AOC.sh --samples samples.csvFor a quick setup check before a full manual run, use:
bash tests/test_installation.shsbatch submit_AOC.slurmThe example submit_AOC.slurm writes Slurm stdout/stderr to AOC_<jobid>.out and AOC_<jobid>.err in the submission directory, so no pre-existing logs/ folder is required.
AOC integrates several widely used codon-based evolutionary models implemented in HyPhy (Hypothesis Testing using Phylogenies) to detect signals of natural selection across protein-coding genes. These approaches operate at different biological scales (site, branch, lineage, and gene) and capture complementary evolutionary signals. Using multiple tests together improves robustness because positive selection can manifest differently depending on the evolutionary scenario.
All models are based on codon substitution frameworks (typically variants of MG94 or related codon models) that estimate the ratio of nonsynonymous to synonymous substitution rates (dN/dS, also called ω).
Interpretation of ω:
- ω > 1 → positive (diversifying) selection
- ω = 1 → neutral evolution
- ω < 1 → purifying selection
Each HyPhy method tests a different hypothesis about how selection acts across sites and lineages.
| Method | Scale | Purpose |
|---|---|---|
| FEL | Site | Pervasive selection |
| MEME | Site | Episodic selection |
| aBSREL | Branch | Adaptive branch selection |
| BUSTED-S-MH | Gene | Gene-wide episodic selection |
| CFEL | Site | Contrast site-level selection |
| RELAX | Lineage | Selection intensity shifts |
FEL (Fixed Effects Likelihood) tests for pervasive selection at individual codon sites. It estimates synonymous and nonsynonymous substitution rates independently for each site using maximum likelihood.
Key characteristics:
- Detects consistent selection across the entire phylogeny
- Identifies sites under persistent positive or negative selection
- Conservative but interpretable site-level estimates
FEL is most appropriate when the selective pressure is expected to be stable across evolutionary time.
Documentation:
https://hyphy.org/methods/selection-methods/#fel
MEME (Mixed Effects Model of Evolution) detects episodic positive selection at individual sites. Unlike FEL, MEME allows selection to occur only on a subset of branches.
Key characteristics:
- Detects transient or lineage-specific adaptive events
- Combines site-level and branch-level modeling
- Powerful for detecting adaptive bursts
MEME is widely used when adaptive events are expected to occur sporadically during evolution.
Documentation:
https://hyphy.org/methods/selection-methods/#meme
aBSREL identifies branches experiencing episodic diversifying selection across a gene.
Key characteristics:
- Tests each branch independently
- Allows multiple ω rate classes on each branch
- Detects adaptive episodes affecting subsets of sites
This method is useful for identifying specific evolutionary lineages undergoing adaptation.
Documentation:
https://hyphy.org/methods/selection-methods/#absrel
BUSTED-S-MH tests for gene-wide episodic positive selection on a predefined set of branches.
Key characteristics:
- Gene-level hypothesis test
- Determines whether any site on any tested branch experienced positive selection
- Incorporates synonymous rate variation and multi-hit substitutions
BUSTED-type methods are often used as a first-pass test to determine whether a gene contains evidence of episodic adaptation before conducting site-level analyses.
Documentation:
https://hyphy.org/methods/selection-methods/#busted
CFEL compares selection pressures between predefined groups of branches.
Key characteristics:
- Tests whether site-specific selection differs between two lineages
- Identifies lineage-specific evolutionary constraints or adaptations
- Useful in comparative evolutionary studies
For example, CFEL can test whether a site experiences stronger purifying selection in one clade compared to another.
Documentation:
https://hyphy.org/methods/selection-methods/#cfel
RELAX tests whether selection has intensified or relaxed along specific lineages.
Key characteristics:
- Quantifies shifts in selection strength using parameter k
- k > 1 → intensified selection
- k < 1 → relaxed selection
RELAX is particularly useful for studying evolutionary scenarios such as:
- host shifts
- changes in population size
- functional constraint loss
Documentation:
https://hyphy.org/methods/selection-methods/#relax
Different evolutionary processes leave different statistical signatures. AOC integrates multiple complementary approaches to capture these signals.
| Signal | Method |
|---|---|
| Persistent site-level selection | FEL |
| Episodic site-level adaptation | MEME |
| Adaptive lineages | aBSREL |
| Gene-wide episodic adaptation | BUSTED |
| Lineage-specific site differences | CFEL |
| Selection intensity changes | RELAX |
Together, these analyses provide a comprehensive evolutionary profile of protein-coding genes.
AOC produces a structured set of outputs that summarize evolutionary selection analyses performed with HyPhy. While HyPhy generates detailed JSON output files for each method, AOC automatically parses these results into tabular summaries and visualizations that are easier to interpret and use for downstream analysis.
The outputs are organized by sample and partition, allowing users to examine selection signals across gene partitions or alignment segments.
results/
{sample}/
selection/
part1/
FEL.json
MEME.json
ABSREL.json
BUSTEDS-MH.json
RELAX.json
CFEL.json
part2/
...
tables/
part1/
{sample}.part1.AOC.FEL_Results.csv
{sample}.part1.AOC.MEME_Results.csv
{sample}.part1.AOC.ABSREL_Results.csv
{sample}.part1.AOC.BUSTEDS-MH_Results.csv
{sample}.part1.AOC.RELAX_Results.csv
{sample}.part1.AOC.CFEL_Results.csv
{sample}.AOC.merged_FEL_Results.csv
{sample}.AOC.merged_MEME_Results.csv
{sample}.AOC.merged_ABSREL_Results.csv
{sample}.AOC.merged_BUSTEDS-MH_Results.csv
{sample}.AOC.merged_RELAX_Results.csv
{sample}.AOC.merged_CFEL_Results.csv
{sample}.selection_overview.csv
visualizations/
FEL.merged.png
MEME.merged.png
Each selection method produces a JSON file containing the full statistical output from HyPhy. These files include:
- likelihood estimates
- substitution rate parameters
- site or branch level statistics
- likelihood ratio test statistics
- p-values and corrected p-values
These JSON files preserve the complete analysis output and can be used for advanced downstream analysis or reproducibility.
HyPhy documentation describing these outputs can be found here:
https://hyphy.org/methods/selection-methods/
Because these JSON files contain nested data structures, AOC automatically converts them into more user-friendly tables.
For each partition, AOC generates CSV tables summarizing the key statistics from each selection method.
File:
{sample}.partX.AOC.FEL_Results.csv
FEL detects pervasive selection at individual codon sites.
Typical columns include:
| Column | Meaning |
|---|---|
| CodonSite | Codon position in the alignment |
| alpha | Synonymous substitution rate |
| beta | Nonsynonymous substitution rate |
| dN/dS | Ratio of nonsynonymous to synonymous substitutions |
| p-value | Significance test for selection |
| adjusted_p-value | Multiple testing corrected p-value |
Interpretation:
- dN/dS > 1 suggests positive selection
- dN/dS < 1 suggests purifying selection
- Sites with adjusted p-value ≤ 0.10 are typically considered significant.
File:
{sample}.partX.AOC.MEME_Results.csv
MEME detects episodic positive selection at individual codon sites.
Important columns:
| Column | Meaning |
|---|---|
| CodonSite | Codon position |
| alpha | Synonymous substitution rate |
| beta+ | Nonsynonymous rate under selection |
| p-value | Test for episodic selection |
Interpretation:
- Significant p-values indicate sites experiencing positive selection on at least one branch of the phylogeny.
File:
{sample}.partX.AOC.ABSREL_Results.csv
aBSREL identifies branches of the phylogeny experiencing episodic diversification.
Columns include:
| Column | Meaning |
|---|---|
| Branch | Branch name in the phylogenetic tree |
| Corrected P-value | Multiple-testing corrected significance |
| omega_max | Maximum estimated dN/dS rate |
| significant_branch_0.10 | Indicator for branches under selection |
Interpretation:
- Branches with Corrected P-value ≤ 0.10 show evidence of episodic adaptive evolution.
File:
{sample}.partX.AOC.BUSTEDS-MH_Results.csv
BUSTED-S-MH tests for gene-wide episodic positive selection.
Columns include:
| Column | Meaning |
|---|---|
| p_value | Significance of gene-wide selection |
| LRT | Likelihood ratio test statistic |
| tested_branches | Number of foreground branches tested |
Interpretation:
- p_value ≤ 0.05–0.10 indicates evidence that at least one site on at least one tested branch experienced positive selection.
File:
{sample}.partX.AOC.RELAX_Results.csv
RELAX tests for changes in the intensity of natural selection.
Columns include:
| Column | Meaning |
|---|---|
| k | Selection intensity parameter |
| p_value | Significance test |
| selection_shift | Relaxed or intensified selection |
Interpretation:
- k > 1 → intensified selection
- k < 1 → relaxed selection
File:
{sample}.partX.AOC.CFEL_Results.csv
CFEL compares selection pressures between two groups of branches.
Columns include:
| Column | Meaning |
|---|---|
| CodonSite | Codon position |
| p-value | Statistical test for differential selection |
| significant_site_0.10 | Indicator for significant differences |
Interpretation:
- Significant sites indicate different evolutionary pressures between the compared branch sets.
For each method, AOC combines partition-level tables into a single merged file:
{sample}.AOC.merged_FEL_Results.csv
{sample}.AOC.merged_MEME_Results.csv
{sample}.AOC.merged_ABSREL_Results.csv
{sample}.AOC.merged_BUSTEDS-MH_Results.csv
{sample}.AOC.merged_RELAX_Results.csv
{sample}.AOC.merged_CFEL_Results.csv
These files include a Partition column so results can be compared across partitions.
File:
{sample}.selection_overview.csv
This table provides a concise summary of the selection signals detected across all analyses.
Example:
| sample | partition | method | metric | value |
|---|---|---|---|---|
| BDNF | 1 | FEL | significant_sites_FDR_0.10 | 3 |
| BDNF | 1 | MEME | significant_sites_FDR_0.10 | 1 |
| BDNF | 1 | ABSREL | significant_branches | 2 |
| BDNF | 1 | RELAX | k | 0.85 |
This overview allows users to quickly identify partitions showing strong signals of selection.
AOC also generates plots summarizing selection signals across the alignment.
Examples include:
FEL.merged.png
MEME.merged.png
These plots typically display:
- codon site position on the x-axis
- significance or selection metrics on the y-axis
They help visually identify clusters of sites under selection.
A typical workflow for interpreting AOC results is:
- Examine selection_overview.csv to identify partitions with strong signals.
- Inspect merged FEL and MEME tables to identify specific codon sites under selection.
- Use ABSREL results to determine which phylogenetic branches experienced adaptive evolution.
- Evaluate BUSTED-S-MH results to determine whether the gene shows gene-wide episodic selection.
- Examine RELAX results to detect shifts in selection intensity.
- Use visualizations to identify patterns of selection across the alignment.
Detailed explanations of each method and its statistical output are available in the HyPhy documentation:
https://hyphy.org/methods/selection-methods/
Users interested in advanced interpretation or methodological details should consult the original HyPhy publications associated with each method.
If you use AOC in your work, please cite:
Lucaci AG, Pond SLK. AOC: Analysis of Orthologous Collections - an application for the characterization of natural selection in protein-coding sequences. ArXiv [Preprint]. 2024 Jun 13:arXiv:2406.09522v1. PMID: 38947939; PMCID: PMC11213150.
GPL-3.0