micrite-gethuman is a nextflow pipeline designed to extract high-confidence host (human) reads from clinical sequencing data.
When searching for microbial sequences in human clinical samples, it is helpful to have a "ground truth" subset of human DNA from those same samples. This subset can serve as a baseline to compare putative microbial hits too with metrics like base qualities (PHRED scores), helping to distinguish true biological signals from sequencing noise or artifacts.
This pipeline processes BAM files (paired-end reads aligned to a human reference) and applies multiple filters to ensure only the most reliable host reads are retained.
The pipeline extracts reads that meet the following criteria:
- Primary Alignments Only: Excludes secondary or supplementary alignments.
- Proper Pairs: Both reads must be oriented and spaced as expected by the aligner, and cannot be marked as PCR/optical duplicates.
- Expected Reference Chromosome Maps specifically to a user-defined set of
--hostchroms(e.g., "chr1 chr2") to ensure no decoy contig alignments contaminate the outputs. - **Quality Thresholds:**Exceeds a user-specified Mapping Quality (MAPQ).
- Length Thresholds: Exceeds a user-specified minimum Query Length.
After that initial filter, we randomly subsample to a specific number of reads based on --nreads argument.
Random subsampling uses the seed 111 by defualt but can be changed using the process directive (task.ext.seed = )
nextflow run selkamand/micrite-gethuman -profile docker \
--sampleid testsample \
--hostchroms "chr1 chr2" \
--min_query_length 10 \
--min_mapping_quality 20 \
--nreads 5To verify the installation and workflow logic, run the built-in test profile:
nextflow run . -profile docker,test
See test file readme for details on what to expect in test run output