This repository holds the code to run the pipeline for Bacterial Meningitis Genomic Analysis Platform (BMGAP) on the command line.
This pipeline is used for processing raw reads from a sequencing run and currently only supports Illumina platforms. This pipeline was tested on Sun Grid Engine (SGE) and automatically submits jobs to your cluster. To pass variables in from the Conda environment, ensure the use of -V when submitting jobs.
Creating the environment from *yml file:
conda env create -f BMGAP_Conda_all.ymlDownload and store human genome:
aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/ ./analysis_scripts/hg38Build the PubMLST database in the PMGA folder within the analysis_scripts directory (**NOTE: The database should be updated regularly based on usage frequency to ensure the most current PubMLST data):
python build_pubmlst_dbs.py -o pubmlst_dbs_allBuild the RefSeq Mash sketch for BMScan and PMGA:
wget -qO- http://gembox.cbcb.umd.edu/mash/RefSeqSketchesDefaults.msh.gz | gunzip | tee analysis_scripts/SpeciesDB/lib/RefSeqSketchesDefaults.msh > analysis_scripts/PMGA/lib/RefSeqSketchesDefaults.mshUnzip the MLST file in the locusextractor folder:
gunzip analysis_scripts/locusextractor/settings_antibiotics/lookupTables/Isolate2MLST2Species.txt.gzBMGAP-RUNNER.sh <FASTQ_DIR> <ANALYSIS_DIRECTORY>
arguments:
FASTQ_DIR Input Directory: Directory of paired-end FASTQ files to analyze
ANALYSIS_DIRECTORY Output Directory: Directory where results should be placed in ---
config:
layout: dagre
theme: redux
look: neo
---
flowchart TB
subgraph Assembly["Assembly"]
direction LR
a1["Human DNA removal (bowtie2)
Adaptor removal
Quality trimming (cutadapt)
De novo assembly (Spades)"]
a3["Assembly QC"]
end
subgraph Characterization["Characterization"]
a4["Species identification using BMScan"]
a5["PMGA
(Serogroup/serotype characterization and genome annotation)"]
a6["Locus Extractor
(MLST identification)"]
end
subgraph AMR["AMR"]
b4["AMR related genes identification from gene list"]
b5["AA substitutions identification from known substitutions"]
b6["Resistance prediction (genotype to phenotype)"]
end
a1 --> a3
a4 --> a5
a5 --> a6
A["FASTQ files"] ---> Assembly
Assembly ---> Characterization
Characterization ---> AMR
b4 --> b5
b5 --> b6
We encourage you to send results to CDC to have the most robust molecular surveillance system.
The following is how you can submit data to our national molecular surveillance system from analysis_scripts:
PrepareToShare.sh <Result directory> <Lab_Name>This will produce a .tgz file. Please attach this compressed file to an email along with a metadata spreadsheet to mpdlb_informatics@cdc.gov.
By providing data back to CDC, you will help enrich our surveillance system and will help other public health labs in the process, even across jurisdictional lines. Also by providing data back to CDC, you are allowing us to place the data in a secure database accessible only by our partners in public health.
CDC database can be accessible by using SAMS account. If you would like access to a SAMS account, please reach out to mpdlb_informatics@cdc.gov with a request 1.
To verify the proper installation of the pipeline, database and dependencies, from NCBI, download SRR8034137. Once downloaded, change the name convention from SRR8034137_1.fastq.gz and SRR8034137_2.fastq.gz to SRR8034137_R1.fastq.gz and SRR8034137_R2.fastq.gz. Run the test isolate and compare to the expected results found in the test folder.
Here is one example of how to download SRR8034137 into a folder test.in.
fasterq-dump SRR8034137 --threads 1 --outdir test.in --split-files --skip-technicalFootnotes
-
User can only access their own data, depending on the permission ↩