Reads best mapped to decoys not documented in `--writeUnmappedNames` output file

**Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?**
Salmon

**Describe the bug**
Only `u` (unmapped) reads are labelled in the `aux_info/unmapped.txt` file, even when the log file indicates that many fragments were discarded because they are best-mapped to decoys. 

**To Reproduce**

This GitHub repository details the motivation and full workflow of my pipeline: https://github.com/greenelab/2022-microberna/

To get the read files, I ran:
```
rule rnaseq_sample_download:
    output:
        reads="outputs/rnaseq_fastp/{sra}.fq.gz",
        json = "outputs/rnaseq_fastp/{sra}.fastp.json",
        html = "outputs/rnaseq_fastp/{sra}.fastp.html"
    params: tmp_base = lambda wildcards: "inputs/tmp_raw/" + wildcards.sra
    threads: 1
    resources:
        mem_mb=8000
    run:
        row = m.loc[m['experiment_accession'] == wildcards.sra]
        fastqs = row['fastq_ftp'].values[0]
        fastqs = fastqs.split(";")
        if len(fastqs) == 1:
            # single end data; download and stream directly to fastp for trimming.
            fastq = fastqs[0]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + ".fastq.gz"):
                shell("wget -O {params.tmp_base}.fastq.gz ftp://{fastq}")

            shell("fastp -i {params.tmp_base}.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + ".fastq.gz")

        else:
            # paired end data; download both files, interleave, and then remove files
            fastq_1 = fastqs[0]
            fastq_2 = fastqs[1]
            shell("mkdir -p inputs/tmp_raw")
            if not os.path.exists(params.tmp_base + "_1.fastq.gz"):
                shell("wget -O {params.tmp_base}_1.fastq.gz ftp://{fastq_1}")

            if not os.path.exists(params.tmp_base + "_2.fastq.gz"):
                shell("wget -O {params.tmp_base}_2.fastq.gz ftp://{fastq_2}")

            shell("fastp -i {params.tmp_base}_1.fastq.gz -I {params.tmp_base}_2.fastq.gz --json {output.json} --html {output.html} -R {wildcards.sra} --stdout | gzip > {output.reads}")

            # check that the file exists, and if it does, remove raw fastq files
            if os.path.exists(output.reads):
                os.remove(params.tmp_base + "_1.fastq.gz")
                os.remove(params.tmp_base + "_2.fastq.gz")
```

However, I think it would be fine to reproduce from untrimmed files:
```
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR436/001/ERR4363141/ERR4363141.fastq.gz
```

I generated the transcriptome by annotating all publicly available genomes for my species (*Faecalibacterium prausnitzii_C*). Using these annotations, I cut out coding domain sequences and nc/r/tRNAs and clustered the sequences at 95% identity. Then, I took the complement of these sequences (all the left over intergenic stuff) and designated these as decoys. 

I indexed the transcriptome with:
```
rule index_transcriptome:
    input:
        seqs=ancient("outputs/gtdb_genomes_salmon_ref/{gtdb_species}.fa"),
        decoys=ancient("outputs/gtdb_genomes_intergenic_comb/{gtdb_species}_clustered_intergenic_seq_names.txt")
    output: "outputs/gtdb_genomes_salmon_index/{gtdb_species}/info.json"
    threads: 1
    params: index_dir = lambda wildcards: "outputs/gtdb_genomes_salmon_index/" + wildcards.gtdb_species
    resources:
        mem_mb=16000,
        tmpdir=TMPDIR
    conda: "envs/salmon.yml"
    benchmark:"benchmarks/salmon_index_{gtdb_species}.txt"
    shell:'''
    salmon index -t {input.seqs} -i {params.index_dir} --decoys {input.decoys} -k 31
    '''
```

I've attached my reference transcriptome and my file of decoy names at the bottom of this issue

Details -- 

  * Which version of salmon was used?: 1.6.0
  * How was salmon installed (compiled, downloaded executable, through bioconda)?: Miniconda
  * Which reference (e.g. transcriptome) was used?: Self-generated
  * Which read files were used?: ERX4307280, SRX10245671, SRX3847835
  * Which which program options were used?

```
salmon quant -i {params.index_dir} -l A -r {input.reads} -o {params.out_dir} --validateMappings --writeUnmappedNames
```

**Expected behavior**
I expected the reads that were counted in the log file as "discarded because they are best-mapped to decoys" to be labelled in the `aux_info/unmapped_names.txt` file with `d`, but all reads were marked as `u`.

**Screenshots**
```
$grep "Number of fragments discarded because they are best-mapped to decoys" */logs/*

ERX4307280_quant/logs/salmon_quant.log:[2022-02-02 15:51:25.854] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 948,850
SRX10245671_quant/logs/salmon_quant.log:[2022-02-02 15:57:10.321] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 961,758
SRX3847835_quant/logs/salmon_quant.log:[2022-02-02 15:33:54.185] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 220,351
```

**Desktop (please complete the following information):**
 - OS: Linux
 - Version: 
```
Linux farm.cse.ucdavis.edu 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
```
```
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic
```
**Additional context**
I intentionally mapped all three libraries as SE, even though two are PE. Because of the presence of polycistronic transcripts in microbes, many paired-end reads would be discordant, which causes counts to look very...odd. See [this preprint](https://www.biorxiv.org/content/10.1101/2022.01.24.477642v1) for more details on that phenomenon.

I'm trying to use the decoys as a first step in identifying reads that map to intergenic sequences, where reads might span two coding domain sequences, or land in the intergenic sequences between two coding domain sequenes. (#52)

Decoy names: [s__Faecalibacterium_prausnitzii_C_clustered_intergenic_seq_names.txt.gz](https://github.com/COMBINE-lab/salmon/files/8005978/s__Faecalibacterium_prausnitzii_C_clustered_intergenic_seq_names.txt.gz)
Transcriptome: [s__Faecalibacterium_prausnitzii_C.fa.gz](https://github.com/COMBINE-lab/salmon/files/8005980/s__Faecalibacterium_prausnitzii_C.fa.gz)
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reads best mapped to decoys not documented in `--writeUnmappedNames` output file #748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reads best mapped to decoys not documented in --writeUnmappedNames output file #748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Reads best mapped to decoys not documented in `--writeUnmappedNames` output file #748