A toolset for handling sequencing data with unique molecular identifiers (UMIs)
This tools set requires Python 3.
To install umitools, run
pip3 install umitools # add --user if you want to install it to your own directorywget -O clipped.fq.gz "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.sRNA-seq.fq.gz"umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fqwget -O "r1.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r1.fq.gz"
wget -O "r2.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r2.fq.gz"umitools reformat_fastq -l r1.fq.gz -r r2.fq.gz -L r1.fmt.fq.gz -R r2.fmt.fq.gzAnd it will output some stats for your UMI RNA-seq data.
2. Then you can use your favorite RNA-seq aligner (e.g. STAR) to map these reads to the genome and get a BAM/SAM file (e.g., fmt.bam).
To download an example, run
wget -O fmt.bam https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.sorted.bamTo mark the reads with PCR duplicates (and assuming you want to use 8 threads), run
umitools mark_duplicates -f fmt.bam -p 8And it will produce fmt.deumi.sorted.bam in which reads that are identified as PCR duplicates will have the flag 0x400. If your downstream analysis (e.g., Picard) can take into consideration this flag, then you are good to go! Otherwise, you can just eliminate PCR duplicates:
samtools view -b -h -F 0x400 fmt.deumi.sorted.bam > fmt.deumi.F400.sorted.bamYou can then feed the bam file without PCR duplicates to your downstream analysis.
For UMI RNA-seq, the UMI locator in each read is required to exactly match GGG, TCA, or ATC. You can customize the locator sequence by setting --umi-locator LOCATOR1,LOCATOR2,LOCATOR3,LOCATOR4 when you run umi_reformat_fastq.
For UMI small RNA-seq, the default setting requires that the 5' UMI locator in each read should match NNNCGANNNTACNNN or NNNATCNNNAGTNNN, AND 3' UMI locator should match NNNGTCNNNTAGNNN where N's are not required to match and there is at most 1 error across all non-N positions. You can customized the locator sequence for small RNA-seq by setting --umi-pattern-5 and --umi-pattern-3. You can further tweak the number of errors allowed by changing N_MISMATCH_ALLOWED_IN_UMI_LOCATOR in the script.
A simple in silico PCR simulator for UMI reads. Run it with -h to see options.
In addition to providing subcommands to umitools (e.g., umitools mark_duplicates), these commands can also be called individually.
umitools reformat_fastqis equivalent toumi_reformat_fastq.umitools mark_duplicatesis equivalent toumi_mark_duplicates.umitools reformat_sra_fastqis equivalent toumi_reformat_sra_fastq.
There are many tools to remove adapters. This is just one example. To process a fastq (raw.fq.gz) file from your UMI small RNA-seq data, you can first remove the 3' end small RNA-seq adapter. For example, you can use fastx_clipper from the FASTX-Toolkit and the adapter sequence is TGGAATTCTCGGGTGCCAAGG:
zcat raw.fq.gz | fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 48 -c -Q33 2> raw.clipped.log | gzip -c - > clipped.fq.gzwhere -l 48 specified the minimum length of the reads after the adapter removal, since I want to make sure all reads are at least 18 nt (18 nt + 15 nt in the 5' UMI + 15 nt in the 3' UMI).
To see which reads have improper UMIs, run
umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fq --reads-with-improper-umi sra.improper_umi.fqwhere sra.umi.fq contains all the non-duplicate reads and sra.dup.fq contains all duplicates.
- Grab the version on GitHub:
git clone https://github.com/weng-lab/umitools.git- Install it in editable mode:
pip3 install -e /path/to/umitoolsFu, Y., Wu, P.-H., Beane, T., Zamore, P.D., and Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 19, 531.
Yu Fu (Yu.Fu {at} umassmed.edu)