MSATK is a Python and command-line toolkit for profiling multiple sequence alignments. It converts nucleotide, protein, and codon-aware alignments into quality-control summaries, per-sequence and per-site statistics, publication-ready plots, codon usage metrics, RSCU tables, and standalone HTML reports.
MSATK: one command to profile, visualize, and report multiple sequence alignments.
mamba install -c bioconda -c conda-forge msatk
msatk profile alignment.fastaMSATK automatically detects the input format and molecule type, computes summary statistics, generates plots when plotting dependencies are available, and writes a complete report.
alignment_msatk/
|-- report.html
|-- summary.json
|-- parameters.yaml
|-- msatk.log
|-- qc_warnings.txt
|-- tables/
`-- figures/
Try the bundled demo:
msatk demoThe recommended bioinformatics-native install path is Bioconda:
mamba install -c bioconda -c conda-forge msatkor:
conda install -c bioconda -c conda-forge msatkBefore the official Bioconda package is available, create an environment with Conda-managed dependencies and install MSATK from PyPI:
conda env create -f environment.yml
conda activate msatkAfter Bioconda packaging is live, use:
conda env create -f environment-bioconda.yml
conda activate msatkpip install msatkgit clone https://github.com/aglucaci/MSATK
cd msatk
pip install -e ".[dev,all,docs]"From the repository root:
pip install -e ".[dev,all,docs]"Run the bundled demo:
msatk demo --out msatk_demo --forceProfile your own alignment:
msatk profile path/to/alignment.fastaWrite to a specific output directory:
msatk profile path/to/alignment.fasta --out results --forceRun codon-aware or protein-specific modes:
msatk codon path/to/cds_alignment.fasta --out codon_results --force
msatk protein path/to/protein_alignment.faa --out protein_results --forceRun tests and checks:
pytest
ruff check src/msatk tests
ruff format --check src/msatk tests
mkdocs build --strictIf you are running without installing the package, set PYTHONPATH first:
# macOS/Linux
PYTHONPATH=src python -m msatk.cli demo --out msatk_demo --force
# Windows PowerShell
$env:PYTHONPATH="src"; python -m msatk.cli demo --out msatk_demo --force- One-command alignment profiling
- DNA, RNA, protein, and codon-aware modes
- Common alignment inputs: FASTA/aligned FASTA, A3M, PHYLIP/relaxed PHYLIP, CLUSTAL, Stockholm, NEXUS, MAF, SAM, BAM, and CRAM
- Publication-ready PNG plots with optional plotting dependencies
- CSV, JSON, HTML, and Markdown outputs
- Codon usage, RSCU, GC1/GC2/GC3, and stop codon summaries
- Protein amino-acid, residue-class, and hydrophobicity summaries
- QC warnings with plain-language interpretation
- Python API for notebooks and workflows
- Stable output files for Snakemake, Nextflow, and HPC pipelines
Core formats:
- FASTA
- aligned FASTA
- PHYLIP
- relaxed PHYLIP
- NEXUS
- CLUSTAL
- Stockholm
- A3M
- MAF
- SAM
- BAM/CRAM-derived alignment summaries
Sequence types:
- nucleotide alignments
- amino-acid alignments
- codon alignments
- translated CDS alignments
- mixed or auto-detected mode
MSATK automatically infers:
- file format
- molecule type: DNA, RNA, protein, codon, translated CDS, or mixed
- alignment length
- whether sequence lengths are consistent
- whether stop codons exist
- whether frameshift warnings are present
- whether the alignment appears codon-aware
These fields are written to summary.json, tables/alignment_summary.csv, and the HTML report.
pip install "msatk[all]"Useful extras:
pip install "msatk[dataframes]"
pip install "msatk[plots]"
pip install "msatk[embed]"
pip install "msatk[ngs]"SAM is supported without extra dependencies. BAM and CRAM require pysam; install with pip install "msatk[ngs]" or mamba install -c bioconda pysam.
Developer install:
pip install -e ".[dev,all]"MSATK is intended to be distributed through PyPI first, then Bioconda:
PyPI first -> local Conda recipe -> Bioconda PR -> CI-tested Conda builds
The local Conda recipe is in:
conda-recipe/meta.yaml
Build and test it locally:
mamba create -n conda-build-env -c conda-forge conda-build boa anaconda-client
conda activate conda-build-env
conda build conda-recipe
mamba create -n test-msatk --use-local msatk
conda activate test-msatk
msatk --help
msatk profile --helpFor Bioconda submission after a PyPI release, use:
conda-recipe/meta.bioconda.yaml
Update the PyPI source SHA256, copy it to recipes/msatk/meta.yaml in a fork of bioconda/bioconda-recipes, and open a pull request.
msatk profile alignment.fasta
msatk qc alignment.fasta
msatk codon cds_alignment.fasta
msatk protein protein_alignment.faa
msatk embed alignment.fasta --method pca
msatk report alignment_msatk/
msatk demoExample terminal output:
MSATK alignment profile complete
Input: alignment.fasta
Detected type: codon
Sequences: 248
Alignment length: 3,642
Gap fraction: 4.8%
Variable sites: 712
Mean pairwise identity: 91.4%
Results written to:
alignment_msatk/
Open report:
alignment_msatk/report.html
from msatk import MSATK, CodonProfiler, ProteinProfiler, profile_alignment
profiler = MSATK("alignment.fasta")
summary = profiler.summary()
per_sequence = profiler.per_sequence_stats()
per_site = profiler.per_site_stats()
profiler.plot_gap_profile()
profiler.plot_entropy()
profiler.write_report("report.html")
results = profile_alignment("alignment.fasta", outdir="alignment_msatk")
codon = CodonProfiler("cds_alignment.fasta")
rscu = codon.rscu()
protein = ProteinProfiler("protein_alignment.faa")
aa = protein.amino_acid_composition()Lowercase alias is also supported:
from msatk import msatk
profiler = msatk("alignment.fasta")When pandas is installed, table-like API methods return pandas DataFrames. In minimal environments, MSATK falls back to lists of dictionaries.
0.1: one-command profiling, summary statistics, QC, tables, report0.2: hardened format support, better terminal output, example datasets0.3: codon-aware release with RSCU, GC1/GC2/GC3, stop codon detection0.4: visualization themes, SVG output, richer report customization0.5: embeddings, clustering, outlier detection1.0: stable API/output schema, PyPI/Bioconda, containers, workflow examples