requirements:
- Python3.6: Python3 is OK, but I'd like to use fstring in Python3.6, if you're using Python3 but under 3.6, feel free to replace the
fstringstatement withformatstatement. - path.py
- biopython: BioPython is for fastq files processing.
- pysam: standard library for bam file processing.
- fastinterval: genomic interval processing.
it only works with sorted bam file, with UMI at the end of reads id:
M03074:71:000000000-B49DJ:1:2103:19133:13331:TTCGCACGA
M03074:71:000000000-B49DJ:1:2105:4089:17175:TTCGCACGA
for UMI pre-processing, here are some choices:
if your data is from illumina platform, just use bcl2fastq.
Before you doing this, you must make sure the UMI cycles is in either read1 or read2 cycles, you can change Reads section in RunInfo.xml:
for example, if your UMI is in INDEX2, with 10 bases, your RunInfo.xml will looks like:
<Read Number="1" NumCycles="150" IsIndexedRead="N" />
<Read Number="2" NumCycles="8" IsIndexedRead="Y" />
<Read Number="3" NumCycles="10" IsIndexedRead="Y" />
<Read Number="4" NumCycles="150" IsIndexedRead="N" />
since bcl2fastq can not handle the UMI in the index region, you must combine the UMI cycles to reads2 cycles, like:
<Read Number="1" NumCycles="150" IsIndexedRead="N" />
<Read Number="2" NumCycles="8" IsIndexedRead="Y" />
<Read Number="4" NumCycles="160" IsIndexedRead="N" />
now, bcl2fastq will take the UMI cycles for Reads2.
set settings section in your SampleSheet.csv:
[Settings]
Read1StartFromCycle,1
Read1EndWithCycle,150
Read2StartFromCycle,1
Read2EndWithCycle,160
Read2UMIStartFromCycle,1
Read2UMILength,10
TrimUMI,1
make sure your csv file with same number of columns.
fastp is a fast all-in-one tool for preprocessing FastQ files.
It can extract UMI too.
using bwa or other aligner, produce a sorted bam file, for example:
bwa mem -t 8 -M hg19.fasta read1.fastq read2.fastq | \
samtools view -@ 8 -S -b | \
samtools sort -@ 8 -o sample_name.sorted.bam from consensus_maker import ConsensusWorker
worker = ConsensusWorker('sample_name.sorted.bam', 'consensus.1.fastq', 'consensus.2.fastq',
bed_file='target.bed')
worker.output_pe_reads()TODO:
- make a simple script;
- UMI analysis and report;
- finish testing;
- support for
a-bfamily andb-afamily.
welcome to pull a request.