- Docker Quickstart
- Making a release
- How it works
- Building the images locally
- Running directly
- Index the data
- Sanitiser
- Singularity
This assumes you have already built the mlst/cgmlst images using the standard pipeline.
Search my.fasta against the klebsiella MLST scheme downloaded on 3rd Sept 2024 and save the results in
my_output.json.
cat my.fasta > docker run --rm -i {IMAGE_PATH}/cgps-mlst:2024-09-03-klebsiella_1 > my_output.jsonFor a full list of schemes and their tags view the schemes.json file.
You can get information for debugging by passing in the DEBUG environment variable, e.g:
... | docker run --rm -i -e DEBUG='cgps:*,-cgps:trace*' registry...
The output data also includes more details if you set the DEBUG environment variable. This includes
the position of the best match and any other close matches. You can see this without much clutter
by setting DEBUG='.'.
- If the code has changed create a new code image according to the instructions
- Create the individual scheme images as defined in CGPS Typing scripts and save the output file.
- Run
python3 build.py -v [code image version] [scheme file CSV] > latest_schemes.json
The follow the instructions as for a full release but only provide a single line CSV file.
scheme shortname,date stamp,scheme image name
This is the format output by the typing-databases build script.
- The shortname is as in the schemes.json file.
- The date stamp is expected to be ISO format, e.g.
2024-09-03. - The scheme image name.
This project indexes and searches typing databases which have been downloaded using the CGPS Typing scripts. These scripts download data from a variety of sources and reformat them consistently. The build is stored as a docker image. This will take several hours if you are trying to update all databases.
This project indexes the typing databases so that typing can be run quickly. This includes hashing all known alleles of each locus.
Genomes are typed by searching for exact matches and by calling Blast. Exact matches are found by looking for prefixes and in the assembly and then comparing the hash of a sequence with a list of known hashes.
BLAST is used to identify novel alleles (i.e. ones which are not included in the database). This is done in a couple of rounds. The first round BLASTs a small number of alleles against the genome to identify areas which might contain alleles. The results of this initial round are compared with the results of the exact matching to identify which (if any) loci might have novel hits.
A second round of Blast uses a larger number of alleles for each locus, but only for the loci which the previous step showed might have a novel allele.
Each locus can have more than one hit for a given genome (which may an artifact of the specimen, an assembly error, contamination, etc.). It is important to identify cases where hits from Blast or exact matching overlap for a given locus; some databases include alleles which are truncations of one another and we want to return the "best" result.
There are two parts to the algorithm for historical reasons. This section describes the core search process once exact hits have been identified. Broadly speaking the algorithm could be considered as follows:
- For each locus, create bins containing exact and inexact hits which overlap by more than 80% on a given contig of the assembly
- Assess which the best hit is for each locus in each bin
- Report those hits (i.e. normally one per locus, but sometimes multiple)
Best is defined as follows for a given bin of hits for a given locus in the database:
- If there are any exact hits, return the longest, if not
- Discard hits which cover less than 80% of the length of the specified allele
- Find the hit with the greatest percentage identity
- Discard any hits which are 2% worse than the best percentage identity
- Return the hit with the most matching bases
- To break ties, return the but with the greatest percentage identity
Finally, in the second part all matches are resolved down to build a PubMLST-type profile.
- One match per locus
- If more than one exact hit is found then the lowest ST is used.
There are three stages to building the containers:
- Build the database images
- Build the code image
- Build the final images
The final image build consists of two stages. The first indexes the schemes while the second creates a compiled image of the code and indexed scheme.
Dockerfile.schemedev allows the building of a development environment for running test code inside.
Schemes can be mounted from images for testing locally using the (e.g) the following command:
docker run --rm -it -v /home/corin/cgps-gits/pathogenwatch/mlst/db:/db --entrypoint /bin/sh registry.gitlab.com/cgps/pathogenwatch/analyses/typing-databases:2024-09-03-klebsiella_1Similarly, indexed schemes can be mounted from built indexes (or index stage images).
The main analysis script (index.js) supports the following options:
--scheme- Shortname of the MLST scheme to use (overrides SCHEME environment variable)--indexDir- Directory containing the indexed scheme data (overrides INDEX_DIR environment variable)
SCHEME- Shortname of the MLST scheme to useINDEX_DIR- Directory containing the indexed scheme data (default:index_dir)
Run MLST analysis using environment variables:
export SCHEME=klebsiella_1
export INDEX_DIR=index_dir
cat my_genome.fasta | node index.jsRun MLST analysis with command line options:
cat my_genome.fasta | node index.js --scheme=klebsiella_1 --indexDir=custom_indexRun MLST analysis in Docker:
cat my_genome.fasta | docker run --rm -i cgps-mlst:klebsiella_1The script reads FASTA input from stdin and outputs JSON results to stdout.
The commands below are run to index the data.
After downloading the data it needs to be indexed before we can use it to call STs. This formats the data into a consistent format and calculates things like hashes of alleles to enable quick exact matches.
Replace ${SCHEME} with the shortname of the scheme.
DEBUG='cgps:info' npm run index -- --scheme=${SCHEME} --index=index_dir --database=/typing-databases
The indexing command supports the following options:
--database,-d- Directory containing the scheme data (required)
--type,-t- Filter schemes by type (e.g., "mlst", "cgmlst")--scheme,-s- Shortname of specific scheme(s) to build (can specify multiple)--index,-i- Directory where the index will be created (default:index_dir)--help,-h- Show help information
Index all schemes from a database directory:
npm run index -- --database=/path/to/typing-databasesIndex only MLST schemes:
npm run index -- --type=mlst --database=/path/to/typing-databasesIndex specific schemes:
npm run index -- --scheme=klebsiella_1 --scheme=ecoli_1 --database=/path/to/typing-databasesIndex to a custom directory:
npm run index -- --index=my_custom_index --database=/path/to/typing-databasesFull command with all options:
npm run index -- --type=mlst --scheme=klebsiella_1 --index=custom_index --database=/path/to/typing-databasesThe full image includes a GO binary called sanitiser. This pre-processes the FASTAs to ensure they are in a format
that can be read reliably by general bioinformatics software. This software is not required for general use and is aimed
at supporting web servers and other systems that might have to process highly variable 3rd party FASTAs.
For more information see the sanitise-fasta GitHub repository.
NB Singularity is supported on an "as-is" basis. We welcome contributions and fixes from the community.
There are two ways to run MLST with Singularity:
- Build the Singularity image from scratch and create indexed databases to run with it.
- Create and convert the final Docker images.
The individual scheme images can be converted using the following approach.
# Convert the docker image to singularity format - edit the image name as appropriate:
docker run -v /var/run/docker.sock:/var/run/docker.sock -v /home/corin/temp:/output --privileged -t --rm quay.io/singularity/docker2singularity registry.gitlab.com/cgps/pathogenwatch-tasks/{mlst/mlst2/cgmlst/ngmast}:{IMAGE_TAG}
# Then prepare the DB folder:
singularity exec pathogenwatch-mlst-231123-v5.2.0.sif cp -rp /usr/local/mlst/index_dir .
#To run it against a genome replace `{/local/path/to/my.fasta}` with the full path to the FASTA file, along with the TAXID parameter:
singularity exec --pwd=/usr/local/mlst --bind {/local/path/to/my.fasta}:/tmp/my.fasta pathogenwatch-mlst-202214121127-v3.2.1.sif sh -c 'cat /tmp/my.fasta | /usr/local/bin/node /usr/local/mlst/index.js'.The individual scheme databases will need downloading first using the CGPS typing database downloader first.
Build image:
singularity build --fakeroot build/mlst.sif mlst.defUsage:
# Run indexer
singularity exec --pwd /usr/local/mlst build/mlst.sif npm run index -- --scheme=<scheme_name> --index=<index_dir> --database=<typing_databases_dir>
# Run mlst
singularity run --pwd /usr/local/mlst build/mlst.sif <input_fasta_file> <output_json_file> <scheme_name> <index_dir>