Skip to content

Reads best mapped to decoys not documented in --writeUnmappedNames output file #748

@taylorreiter

Description

@taylorreiter

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
Salmon

Describe the bug
Only u (unmapped) reads are labelled in the aux_info/unmapped.txt file, even when the log file indicates that many fragments were discarded because they are best-mapped to decoys.

To Reproduce

This GitHub repository details the motivation and full workflow of my pipeline: https://github.com/greenelab/2022-microberna/

To get the read files, I ran:

rule rnaseq_sample_download:
    output:
        reads="outputs/rnaseq_fastp/{sra}.fq.gz",
        json = "outputs/rnaseq_fastp/{sra}.fastp.json",
        html = "outputs/rnaseq_fastp/{sra}.fastp.html"
    params: tmp_base = lambda wildcards: "inputs/tmp_raw/" + wildcards.sra
    threads: 1
    resources:
        mem_mb=8000
    run:
        row = m.loc[m['experiment_accession'] == wildcards.sra]
        fastqs = row['fastq_ftp'].values[0]
        fastqs = fastqs.split(";")
        if len(fastqs) == 1:
            # single end data; download and stream directly to fastp for trimming.
            fastq = fastqs[0]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + ".fastq.gz"):
                shell("wget -O {params.tmp_base}.fastq.gz ftp://{fastq}")

            shell("fastp -i {params.tmp_base}.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + ".fastq.gz")

        else:
            # paired end data; download both files, interleave, and then remove files
            fastq_1 = fastqs[0]
            fastq_2 = fastqs[1]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + "_1.fastq.gz"):
                shell("wget -O {params.tmp_base}_1.fastq.gz ftp://{fastq_1}")

            if not os.path.exists(params.tmp_base + "_2.fastq.gz"):
                shell("wget -O {params.tmp_base}_2.fastq.gz ftp://{fastq_2}")

            shell("fastp -i {params.tmp_base}_1.fastq.gz -I {params.tmp_base}_2.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + "_1.fastq.gz")
                os.remove(params.tmp_base + "_2.fastq.gz")

However, I think it would be fine to reproduce from untrimmed files:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR436/001/ERR4363141/ERR4363141.fastq.gz

I generated the transcriptome by annotating all publicly available genomes for my species (Faecalibacterium prausnitzii_C). Using these annotations, I cut out coding domain sequences and nc/r/tRNAs and clustered the sequences at 95% identity. Then, I took the complement of these sequences (all the left over intergenic stuff) and designated these as decoys.

I indexed the transcriptome with:

rule index_transcriptome:
    input:
        seqs=ancient("outputs/gtdb_genomes_salmon_ref/{gtdb_species}.fa"),
        decoys=ancient("outputs/gtdb_genomes_intergenic_comb/{gtdb_species}_clustered_intergenic_seq_names.txt")
    output: "outputs/gtdb_genomes_salmon_index/{gtdb_species}/info.json"
    threads: 1
    params: index_dir = lambda wildcards: "outputs/gtdb_genomes_salmon_index/" + wildcards.gtdb_species
    resources:
        mem_mb=16000,
        tmpdir=TMPDIR
    conda: "envs/salmon.yml"
    benchmark:"benchmarks/salmon_index_{gtdb_species}.txt"
    shell:'''
    salmon index -t {input.seqs} -i {params.index_dir} --decoys {input.decoys} -k 31
    '''

I've attached my reference transcriptome and my file of decoy names at the bottom of this issue

Details --

  • Which version of salmon was used?: 1.6.0
  • How was salmon installed (compiled, downloaded executable, through bioconda)?: Miniconda
  • Which reference (e.g. transcriptome) was used?: Self-generated
  • Which read files were used?: ERX4307280, SRX10245671, SRX3847835
  • Which which program options were used?
salmon quant -i {params.index_dir} -l A -r {input.reads} -o {params.out_dir} --validateMappings --writeUnmappedNames

Expected behavior
I expected the reads that were counted in the log file as "discarded because they are best-mapped to decoys" to be labelled in the aux_info/unmapped_names.txt file with d, but all reads were marked as u.

Screenshots

$grep "Number of fragments discarded because they are best-mapped to decoys" */logs/*

ERX4307280_quant/logs/salmon_quant.log:[2022-02-02 15:51:25.854] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 948,850
SRX10245671_quant/logs/salmon_quant.log:[2022-02-02 15:57:10.321] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 961,758
SRX3847835_quant/logs/salmon_quant.log:[2022-02-02 15:33:54.185] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 220,351

Desktop (please complete the following information):

  • OS: Linux
  • Version:
Linux farm.cse.ucdavis.edu 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic

Additional context
I intentionally mapped all three libraries as SE, even though two are PE. Because of the presence of polycistronic transcripts in microbes, many paired-end reads would be discordant, which causes counts to look very...odd. See this preprint for more details on that phenomenon.

I'm trying to use the decoys as a first step in identifying reads that map to intergenic sequences, where reads might span two coding domain sequences, or land in the intergenic sequences between two coding domain sequenes. (#52)

Decoy names: s__Faecalibacterium_prausnitzii_C_clustered_intergenic_seq_names.txt.gz
Transcriptome: s__Faecalibacterium_prausnitzii_C.fa.gz

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions