AniAnn's: ANI Inferred ANNotation of Tandem Repeats
AniAnn's is an a priori satellite array detection and annotation software package. AniAnn's uses a matrix of Average Nucleotide Identity (ANI) values similar to ModDotPlot in order to infer the location and orientation of satellite arrays. It introduces new downstream analysis to accurately annotate its contents and boundaries.
You can download the current release from GitHub by using:
git clone https://github.com/marbl/anianns.git
cd anianns
Although optional, we recommend setting up a virtual environment:
python -m venv venv
source venv/bin/activate
Once the virtual environment is activated, you can install the required dependencies:
python -m pip install .
By default, AniAnn's installs without ModDotPlot as a dependency. If you would like to include ModDotPlot into the same venv for plotting, you can install using the following command:
python -m pip install .[moddotplot]
AniAnn's is also available to install using PyPI:
pip install anianns
Once installed, confirm AniAnn's was installed correctly by running python -m anianns -h, or simply with the shortcut anianns -h:
__ __
_ _ _ .' `'._.'` '.
/ \ _ __ (_) / \ _ __ _ __ ' ___ | .--; ;--. |
/ _ \ | '_ \| | / _ \ | '_ \| '_ \ / __| | ( / \ ) |
/ ___ \| | | | | / ___ \| | | | | | | \__ | \ ;` /^\ `; /
/_/ \_\_| |_|_|/_/ \_\_| |_|_| |_| |___/ :` .'._.'. `;
'-`'.___.'`-'
usage: anianns [-h] {annotate,build_db} ...
Ani Ann's: ANI Inferred ANNotation of Tandem Repeats
positional arguments:
{annotate,build_db} Choose mode: annotate or build_db
annotate Takes an input fasta and outputs an annotated bedfile of satellite arrays.
build_db Takes an input fasta, a bedfile of satellite coordinates, and an optional config file and outputs a db of satellite k-mers.
options:
-h, --help show this help message and exit
Note that AniAnn's might take a while to run during your first usage. This is because the Python interpreter is compiling source code into the pycache directory. Subsequent runs will use the pre-compiled code and load much faster!
AniAnn's must be run either in annotate mode, or build_db mode.
anianns annotate -f <FASTA_FILENAME(S)> <ARGS>
AniAnn's requires at least one FASTA file as input. It generates one output annotation file (default BED) per sequence contained in the input FASTA file(s). Output annotation files are named based on the sequence identifier from the FASTA header.
Annotation of arrays into known satellite classes must be done through the use of a satellite k-mer database, using the command --classify <directory>. See creating an annotation database for more information.
-f / --fasta <FILENAME(S)>
Fasta file(s) to input. Multifasta files are accepted.
-s / --seq_id <STR>
Sequence ID to extract (multiple if using multifasta file). Will ignore if not found. Default: None.
-d / --directory <DIR>
Name of output directory. Default: current working directory.
-o / --output-format <STR>
Output annotation file format. Options are BED, GTF, GFF, CSV, TSV, JSON. Default: BED.
-m / --mask <ALL>
Name or repeat length of satellite arrays to mask. Replaces deteced satellites with N's. See Repeat Masking for more info. Default: None.
--soft <BOOL>
Softmask flag. Instead of N's, will force bases lower-case in detected satellite arrays. Must be used with --mask. Default: None.
-c / --classify <DIR>
Directory containing .db or .msh k-mer db files. Required for annotation into known satellites. Default: None.
-t / --threshold <INT>
Confidence threshold. This relates to the minimum percentage of k-mers in a satellite array required to be within that are contained within the selected. A lower number will be more sensitive, at the risk Default: 50.
-k / --kmer <INT>
K-mer size to use. This should be large enough to distinguish unique k-mers with enough specificity, but not too large that sensitivity is removed. Default: 21.
-i / --identity <INT>
Minimum sequence identity cutoff threshold when running ModDotPlot. While it is possible to go as low as 50% sequence identity, anything below 80% is not recommended. Default: 86.
-w / --window <INT>
Dotplot window size, or the number of bp contained within each pixel in a plot. This is proportional to the sensitivity of satellite detection (ie. lower is more accurate, at the expense of runtime). Default: 2000.
--band <FLOAT>
Instead of creating a full NxN matrix (where N is sequence size), AniAnn's uses a banded matrix to reduce runtime. The size of the band can be adjusted here (units in megabases). Increasing this amount will improve the detection of off-target satellite arrays, at the expense of runtime. Default: 2.
--identifier <STR>
Name of identifier. Used when no matches to a k-mer db are found, or if --classify is not provided. bed file to output to. Default: None.
-p / --plot <bool>
Create a self-identity plot of each input sequence, in --band length segments. Default: None.
--verbose <bool>
Verbose logging output. Creates a log file at --directory. Default: None.
--quiet <bool>
Suppress all logging output. Default: None.
anianns annotate -f <FASTA_FILENAME(S)> --mask <ARGS>
Use of AniAnn's as a tool to mask satellite arrays from a given sequence is done through the -m/--mask argument. AniAnn's will output each sequence into its own masked fasta file in the --directory ouptut folder. By default, running --mask with no parameters will mask everything deemed a satellite array.
- If string parameter(s) are provided (e.g.,
hSat1 hSat2), AniAnn's will mask arrays that match the provided class. Note this must be used in conjunction with--classifyin order to match names. Matches are case-insensitive. - If integer parameter(s) are provided (e.g.,
6 7 42), AniAnn's will mask arrays whose predominant monomer length is equal to that provided.
anianns build_db -f <FASTA_FILENAME(S)> -b <BED_FILENAME(S)> <ARGS>
Building a database of satellite k-mers is how AniAnn's can match a detected satellite array into a known repeat class. AniAnn's will extract and compress k-mers at the coordinates provided by a bed file. Running this will create a directory at --directory.
-f / --fasta <FILENAME(S)>
Fasta file(s) to input. Multifasta files are accepted.
-b / --bed <FILENAME(S)>
Bed file(s) to input. Must contain at least 4 columns: 1 chrom, 2 start, 3 end, 4 name.
-c / --config <FILENAME>
Name of config file to use. See Sample database for more infoDefault: None.
-k / --kmer <INT>
K-mer size. Note that k-mers must be the same length when running anianns annotate in order for classification to work. Default: 21.
-d / --directory <DIR>
Specifies the output directory where results will be saved. If not provided, AniAnn's automatically creates an output folder in the current working directory:
- If the BED file contains a track header, the folder name will be derived from that header.
- If no header is present, the folder name will default to the k-mer length (e.g., k_21). Default: current working directory.
-v / --verbose <BOOL>
Verbose logging output. Will output a log file into --directory. Default: False.
-q / --quiet <BOOL>
Suppress all logging output. Default: False.
Here, we will create a db of known satellite arrays using the HG002 human genome CenSat annotation track (credit: Hailey Loucks):
wget https://raw.githubusercontent.com/hloucks/CenSatData/refs/heads/main/HG002/v1.1/hg002v1.1.cenSatv2.0.bed
We want to remove any centromere transition regions ct from this BED file:
awk 'NR==1 || $4!="ct"' hg002v1.1.cenSatv2.0.bed > hg002v1.1.cenSatv2.0.ctRemoved.bed
Finally, we want to group related satellite arrays into the same class. This is done using a --config file. For this example, we will use the file provided in config folder of this repo:
head config/sample_config.json
{
"gSat": [
"gSat(GSAT,GSATX)",
"gSat(GSAT)",
"gSat(GSATII,GSATX)",
"gSat(GSATII,TAR1)",
"gSat(GSATII)",
"gSat(GSATX)",
"gSat(TAR1)"
],
This merges the k-mers of all variations of gSat into the same class. Anything not in the config file is not included in the k-mer db. The config file must be in standard JSON format. If a config file is not provided, each unique value in column 4 of the input bed file will become its own unique class.
Creating a k-mer db for HG002 using the provided config file takes around 3 minutes. This results in 16.9 million unique k-mers, compressed down into a 53mb directory. Note that increasing the k-mer size will increase the directory size, as a more specific k-mer threshold will increase the total number of unique k-mers.
For bug reports or general usage questions, please raise a GitHub issue, or email alex dot sweeten at nih dot gov