Galah - Scalable dereplication and MIMAG calculation for metagenome assembled genomes
Documentation can be found at https://wwood.github.io/galah/.
Galah aims to be a more scalable metagenome assembled genome (MAG) dereplication method. That is, it clusters microbial genomes together based on their average nucleotide identity (ANI), and chooses a single member of each cluster as the representative.
Galah also determines MIMAG quality scores for genomes based on their completeness, contamination and the presence of rRNA and tRNA genes.
Galah uses a greedy clustering approach to speed up genome dereplication, relative to e.g. dRep, particularly when there are many closely related genomes (i.e. >95% ANI). Generated cluster representatives have 2 properties. If the ANI threshold was set to 95%, then:
- Each representative is <95% ANI to each other representative.
- All members are >=95% ANI to the representative.
If --run-checkm2 was specified, or CheckM2 /
CheckM genome qualities were provided,
then the clusters have an additional property:
- Each representative genome has a better quality score than other members of
the cluster. Each genome is assigned a quality score based on the formula
completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000, which is reduced from a quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8. Other quality score formula are available via--quality-formula.
If instead CheckM1/2 qualities are not available, then the following holds instead:
- Each representative genome was specified to Galah before other members of the cluster.
The overall greedy clustering approach was largely inspired by the work of Donovan Parks, as described in Parks et. al. 2020. It operates in 3 steps. In the first step, genomes are assigned as representative if no genomes of higher quality are >95% ANI. In the second step, each non-representative genome is assigned to the representative genome with which it has the highest ANI.
For clustering a set of genomes at 95% ANI:
galah cluster --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna \
--output-cluster-definition clusters.tsvFor clustering a set of contigs at 95% ANI:
galah cluster --cluster-contigs --small-genomes --genome-fasta-files /path/to/contigs.fna \
--output-cluster-definition clusters.tsvFor determining MIMAG quality scores for a set of genomes with CheckM2:
galah analyse --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna \
--output-mimag-summary mimag.tsvFor clustering and determining MIMAG quality scores:
galah process --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna \
--output-cluster-definition clusters.tsv --output-mimag-summary mimag.tsvIf you have any questions or need help, please open an issue.
Galah is developed by the Woodcroft lab at the Centre for Microbiome Research, School of Biomedical Sciences, QUT, with contributions from Samuel Aroney, Antônio Camargo, and Rhys Newell. It is licensed under GPL3 or later.
The source code is available at https://github.com/wwood/galah.
Aroney, S.T.N., Camargo, A.P., Tyson, G.W. and Woodcroft B.J. Galah: More scalable dereplication for metagenome assembled genomes. Zenodo (2024). https://doi.org/10.5281/zenodo.13637856