-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
Salmon
Describe the bug
Only u (unmapped) reads are labelled in the aux_info/unmapped.txt file, even when the log file indicates that many fragments were discarded because they are best-mapped to decoys.
To Reproduce
This GitHub repository details the motivation and full workflow of my pipeline: https://github.com/greenelab/2022-microberna/
To get the read files, I ran:
rule rnaseq_sample_download:
output:
reads="outputs/rnaseq_fastp/{sra}.fq.gz",
json = "outputs/rnaseq_fastp/{sra}.fastp.json",
html = "outputs/rnaseq_fastp/{sra}.fastp.html"
params: tmp_base = lambda wildcards: "inputs/tmp_raw/" + wildcards.sra
threads: 1
resources:
mem_mb=8000
run:
row = m.loc[m['experiment_accession'] == wildcards.sra]
fastqs = row['fastq_ftp'].values[0]
fastqs = fastqs.split(";")
if len(fastqs) == 1:
# single end data; download and stream directly to fastp for trimming.
fastq = fastqs[0]
shell("mkdir -p inputs/tmp_raw")
if not os.path.exists(params.tmp_base + ".fastq.gz"):
shell("wget -O {params.tmp_base}.fastq.gz ftp://{fastq}")
shell("fastp -i {params.tmp_base}.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")
# check that the file exists, and if it does, remove raw fastq files
if os.path.exists(output.reads):
os.remove(params.tmp_base + ".fastq.gz")
else:
# paired end data; download both files, interleave, and then remove files
fastq_1 = fastqs[0]
fastq_2 = fastqs[1]
shell("mkdir -p inputs/tmp_raw")
if not os.path.exists(params.tmp_base + "_1.fastq.gz"):
shell("wget -O {params.tmp_base}_1.fastq.gz ftp://{fastq_1}")
if not os.path.exists(params.tmp_base + "_2.fastq.gz"):
shell("wget -O {params.tmp_base}_2.fastq.gz ftp://{fastq_2}")
shell("fastp -i {params.tmp_base}_1.fastq.gz -I {params.tmp_base}_2.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")
# check that the file exists, and if it does, remove raw fastq files
if os.path.exists(output.reads):
os.remove(params.tmp_base + "_1.fastq.gz")
os.remove(params.tmp_base + "_2.fastq.gz")
However, I think it would be fine to reproduce from untrimmed files:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR436/001/ERR4363141/ERR4363141.fastq.gz
I generated the transcriptome by annotating all publicly available genomes for my species (Faecalibacterium prausnitzii_C). Using these annotations, I cut out coding domain sequences and nc/r/tRNAs and clustered the sequences at 95% identity. Then, I took the complement of these sequences (all the left over intergenic stuff) and designated these as decoys.
I indexed the transcriptome with:
rule index_transcriptome:
input:
seqs=ancient("outputs/gtdb_genomes_salmon_ref/{gtdb_species}.fa"),
decoys=ancient("outputs/gtdb_genomes_intergenic_comb/{gtdb_species}_clustered_intergenic_seq_names.txt")
output: "outputs/gtdb_genomes_salmon_index/{gtdb_species}/info.json"
threads: 1
params: index_dir = lambda wildcards: "outputs/gtdb_genomes_salmon_index/" + wildcards.gtdb_species
resources:
mem_mb=16000,
tmpdir=TMPDIR
conda: "envs/salmon.yml"
benchmark:"benchmarks/salmon_index_{gtdb_species}.txt"
shell:'''
salmon index -t {input.seqs} -i {params.index_dir} --decoys {input.decoys} -k 31
'''
I've attached my reference transcriptome and my file of decoy names at the bottom of this issue
Details --
- Which version of salmon was used?: 1.6.0
- How was salmon installed (compiled, downloaded executable, through bioconda)?: Miniconda
- Which reference (e.g. transcriptome) was used?: Self-generated
- Which read files were used?: ERX4307280, SRX10245671, SRX3847835
- Which which program options were used?
salmon quant -i {params.index_dir} -l A -r {input.reads} -o {params.out_dir} --validateMappings --writeUnmappedNames
Expected behavior
I expected the reads that were counted in the log file as "discarded because they are best-mapped to decoys" to be labelled in the aux_info/unmapped_names.txt file with d, but all reads were marked as u.
Screenshots
$grep "Number of fragments discarded because they are best-mapped to decoys" */logs/*
ERX4307280_quant/logs/salmon_quant.log:[2022-02-02 15:51:25.854] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 948,850
SRX10245671_quant/logs/salmon_quant.log:[2022-02-02 15:57:10.321] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 961,758
SRX3847835_quant/logs/salmon_quant.log:[2022-02-02 15:33:54.185] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 220,351
Desktop (please complete the following information):
- OS: Linux
- Version:
Linux farm.cse.ucdavis.edu 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
Codename: bionic
Additional context
I intentionally mapped all three libraries as SE, even though two are PE. Because of the presence of polycistronic transcripts in microbes, many paired-end reads would be discordant, which causes counts to look very...odd. See this preprint for more details on that phenomenon.
I'm trying to use the decoys as a first step in identifying reads that map to intergenic sequences, where reads might span two coding domain sequences, or land in the intergenic sequences between two coding domain sequenes. (#52)
Decoy names: s__Faecalibacterium_prausnitzii_C_clustered_intergenic_seq_names.txt.gz
Transcriptome: s__Faecalibacterium_prausnitzii_C.fa.gz