CGPS MLST/cgMLST profile assignments

Docker Quickstart

This assumes you have already built the mlst/cgmlst images using the standard pipeline.

Search my.fasta against the klebsiella MLST scheme downloaded on 3rd Sept 2024 and save the results in my_output.json.

cat my.fasta > docker run --rm -i {IMAGE_PATH}/cgps-mlst:2024-09-03-klebsiella_1 > my_output.json

For a full list of schemes and their tags view the schemes.json file.

You can get information for debugging by passing in the DEBUG environment variable, e.g:

... | docker run --rm -i -e DEBUG='cgps:*,-cgps:trace*' registry...

The output data also includes more details if you set the DEBUG environment variable. This includes the position of the best match and any other close matches. You can see this without much clutter by setting DEBUG='.'.

Making a release

Full release

If the code has changed create a new code image according to the instructions
Create the individual scheme images as defined in CGPS Typing scripts and save the output file.
Run python3 build.py -v [code image version] [scheme file CSV] > latest_schemes.json

Individual species releases.

The follow the instructions as for a full release but only provide a single line CSV file.

Input CSV file description

scheme shortname,date stamp,scheme image name

This is the format output by the typing-databases build script.

The shortname is as in the schemes.json file.
The date stamp is expected to be ISO format, e.g. 2024-09-03.
The scheme image name.

How it works

This project indexes and searches typing databases which have been downloaded using the CGPS Typing scripts. These scripts download data from a variety of sources and reformat them consistently. The build is stored as a docker image. This will take several hours if you are trying to update all databases.

This project indexes the typing databases so that typing can be run quickly. This includes hashing all known alleles of each locus.

Genomes are typed by searching for exact matches and by calling Blast. Exact matches are found by looking for prefixes and in the assembly and then comparing the hash of a sequence with a list of known hashes.

BLAST is used to identify novel alleles (i.e. ones which are not included in the database). This is done in a couple of rounds. The first round BLASTs a small number of alleles against the genome to identify areas which might contain alleles. The results of this initial round are compared with the results of the exact matching to identify which (if any) loci might have novel hits.

A second round of Blast uses a larger number of alleles for each locus, but only for the loci which the previous step showed might have a novel allele.

Each locus can have more than one hit for a given genome (which may an artifact of the specimen, an assembly error, contamination, etc.). It is important to identify cases where hits from Blast or exact matching overlap for a given locus; some databases include alleles which are truncations of one another and we want to return the "best" result.

There are two parts to the algorithm for historical reasons. This section describes the core search process once exact hits have been identified. Broadly speaking the algorithm could be considered as follows:

For each locus, create bins containing exact and inexact hits which overlap by more than 80% on a given contig of the assembly
Assess which the best hit is for each locus in each bin
Report those hits (i.e. normally one per locus, but sometimes multiple)

Best is defined as follows for a given bin of hits for a given locus in the database:

If there are any exact hits, return the longest, if not
Discard hits which cover less than 80% of the length of the specified allele
Find the hit with the greatest percentage identity
Discard any hits which are 2% worse than the best percentage identity
Return the hit with the most matching bases
To break ties, return the but with the greatest percentage identity

Finally, in the second part all matches are resolved down to build a PubMLST-type profile.

One match per locus
If more than one exact hit is found then the lowest ST is used.

Building the images locally

There are three stages to building the containers:

Build the database images
Build the code image
Build the final images

The final image build consists of two stages. The first indexes the schemes while the second creates a compiled image of the code and indexed scheme.

A Docker image for local development

Dockerfile.schemedev allows the building of a development environment for running test code inside.

Schemes can be mounted from images for testing locally using the (e.g) the following command:

docker run --rm -it -v /home/corin/cgps-gits/pathogenwatch/mlst/db:/db --entrypoint /bin/sh registry.gitlab.com/cgps/pathogenwatch/analyses/typing-databases:2024-09-03-klebsiella_1

Similarly, indexed schemes can be mounted from built indexes (or index stage images).

Singularity

Running directly

The main analysis script (index.js) supports the following options:

Command Line Options

--scheme - Shortname of the MLST scheme to use (overrides SCHEME environment variable)
--indexDir - Directory containing the indexed scheme data (overrides INDEX_DIR environment variable)

Environment Variables

SCHEME - Shortname of the MLST scheme to use
INDEX_DIR - Directory containing the indexed scheme data (default: index_dir)

Usage Examples

Run MLST analysis using environment variables:

export SCHEME=klebsiella_1
export INDEX_DIR=index_dir
cat my_genome.fasta | node index.js

Run MLST analysis with command line options:

cat my_genome.fasta | node index.js --scheme=klebsiella_1 --indexDir=custom_index

Run MLST analysis in Docker:

cat my_genome.fasta | docker run --rm -i cgps-mlst:klebsiella_1

The script reads FASTA input from stdin and outputs JSON results to stdout.

Index the data

The commands below are run to index the data.

After downloading the data it needs to be indexed before we can use it to call STs. This formats the data into a consistent format and calculates things like hashes of alleles to enable quick exact matches.

Replace ${SCHEME} with the shortname of the scheme.

DEBUG='cgps:info' npm run index -- --scheme=${SCHEME} --index=index_dir --database=/typing-databases

Command Line Options

The indexing command supports the following options:

Required Options

--database, -d - Directory containing the scheme data (required)

Optional Options

--type, -t - Filter schemes by type (e.g., "mlst", "cgmlst")
--scheme, -s - Shortname of specific scheme(s) to build (can specify multiple)
--index, -i - Directory where the index will be created (default: index_dir)
--help, -h - Show help information

Usage Examples

Index all schemes from a database directory:

npm run index -- --database=/path/to/typing-databases

Index only MLST schemes:

npm run index -- --type=mlst --database=/path/to/typing-databases

Index specific schemes:

npm run index -- --scheme=klebsiella_1 --scheme=ecoli_1 --database=/path/to/typing-databases

Index to a custom directory:

npm run index -- --index=my_custom_index --database=/path/to/typing-databases

Full command with all options:

npm run index -- --type=mlst --scheme=klebsiella_1 --index=custom_index --database=/path/to/typing-databases

Sanitiser

The full image includes a GO binary called sanitiser. This pre-processes the FASTAs to ensure they are in a format that can be read reliably by general bioinformatics software. This software is not required for general use and is aimed at supporting web servers and other systems that might have to process highly variable 3rd party FASTAs.

For more information see the sanitise-fasta GitHub repository.

Singularity

NB Singularity is supported on an "as-is" basis. We welcome contributions and fixes from the community.

There are two ways to run MLST with Singularity:

Build the Singularity image from scratch and create indexed databases to run with it.
Create and convert the final Docker images.

Converting existing Docker images

The individual scheme images can be converted using the following approach.

# Convert the docker image to singularity format - edit the image name as appropriate:
docker run -v /var/run/docker.sock:/var/run/docker.sock -v /home/corin/temp:/output --privileged -t --rm quay.io/singularity/docker2singularity registry.gitlab.com/cgps/pathogenwatch-tasks/{mlst/mlst2/cgmlst/ngmast}:{IMAGE_TAG}

# Then prepare the DB folder:
singularity exec pathogenwatch-mlst-231123-v5.2.0.sif cp -rp /usr/local/mlst/index_dir .

#To run it against a genome replace `{/local/path/to/my.fasta}` with the full path to the FASTA file, along with the TAXID parameter:
singularity exec --pwd=/usr/local/mlst --bind {/local/path/to/my.fasta}:/tmp/my.fasta pathogenwatch-mlst-202214121127-v3.2.1.sif sh -c 'cat /tmp/my.fasta | /usr/local/bin/node /usr/local/mlst/index.js'.

Build a singularity image from scratch

The individual scheme databases will need downloading first using the CGPS typing database downloader first.

Build image:

singularity build --fakeroot build/mlst.sif mlst.def

Usage:

# Run indexer
singularity exec --pwd /usr/local/mlst build/mlst.sif npm run index -- --scheme=<scheme_name> --index=<index_dir> --database=<typing_databases_dir>

# Run mlst
singularity run --pwd /usr/local/mlst build/mlst.sif <input_fasta_file> <output_json_file> <scheme_name> <index_dir>

Name		Name	Last commit message	Last commit date
Latest commit History 594 Commits
.github/workflows		.github/workflows
bin		bin
old_build_pipeline		old_build_pipeline
src		src
tests		tests
.dockerignore		.dockerignore
.env		.env
.eslintrc		.eslintrc
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.code		Dockerfile.code
Dockerfile.schemedev		Dockerfile.schemedev
LICENSE.md		LICENSE.md
README.md		README.md
build.py		build.py
entrypoint.sh		entrypoint.sh
index.js		index.js
mlst.sh		mlst.sh
package-lock.json		package-lock.json
package.json		package.json
singularity-mlst.def		singularity-mlst.def
testUtils.js		testUtils.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CGPS MLST/cgMLST profile assignments

Table of Contents

Docker Quickstart

Making a release

Full release

Individual species releases.

Input CSV file description

How it works

Building the images locally

A Docker image for local development

Singularity

Running directly

Command Line Options

Environment Variables

Usage Examples

Index the data

Command Line Options

Required Options

Optional Options

Usage Examples

Sanitiser

Singularity

Converting existing Docker images

Build a singularity image from scratch

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

pathogenwatch-oss/mlst

Folders and files

Latest commit

History

Repository files navigation

CGPS MLST/cgMLST profile assignments

Table of Contents

Docker Quickstart

Making a release

Full release

Individual species releases.

Input CSV file description

How it works

Building the images locally

A Docker image for local development

Singularity

Running directly

Command Line Options

Environment Variables

Usage Examples

Index the data

Command Line Options

Required Options

Optional Options

Usage Examples

Sanitiser

Singularity

Converting existing Docker images

Build a singularity image from scratch

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages