Skip to content

Host read removal with Bowtie 2#49

Merged
skrakau merged 16 commits into
nf-core:devfrom
skrakau:add_host_removal
Jun 18, 2020
Merged

Host read removal with Bowtie 2#49
skrakau merged 16 commits into
nf-core:devfrom
skrakau:add_host_removal

Conversation

@skrakau

@skrakau skrakau commented May 31, 2020

Copy link
Copy Markdown
Member

Here is a suggestion for host read removal using bowtie2. For the host reference sequence either iGenomes or a user specified Fasta reference file can be used.

I also added a MultiQC section for this to display which fraction of reads maps against the host reference and is thus filtered out using a custom content file as suggested by @ewels and @drpatelh (MultiQC/MultiQC#1199), instead of using the standard Bowtie 2 MultiQC format (with its multiple different mapping categories). Moreover, I separated the MultiQC FastQC for before and after preprocessing.

If using this bowtie2 based strategy would be fine for you, I can add a test.

PR checklist

  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repo
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Make sure your code lints (nf-core lint .).
  • Documentation in docs is updated
  • CHANGELOG.md is updated
  • README.md is updated

Learn more about contributing: https://github.com/nf-core/mag/tree/master/.github/CONTRIBUTING.md

@skrakau skrakau requested review from HadrienG and d4straub May 31, 2020 12:00
@d4straub

d4straub commented Jun 2, 2020

Copy link
Copy Markdown
Collaborator

Bowtie2 host removal is fine.
A test and documentation would be great.

@skrakau

skrakau commented Jun 2, 2020

Copy link
Copy Markdown
Member Author

Ok thanks, will do!

Comment thread main.nf Outdated
Comment on lines +559 to +560
zcat ${reads[0]} | echo "Read pairs before removal: \$((`wc -l`/4))" >>${name}_remove_host.log
zcat ${name}_host_unmapped_1.fastq.gz | echo "Read pairs after removal: \$((`wc -l`/4))" >>${name}_remove_host.log

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for debugging? I can't see where ${name}_remove_host.log is used elsewhere.

Might be a bit of an expensive operation if so, zcat on a big FastQ file can take quite a while. And I guess we get this info from your bowtie logs MultiQC module anyway?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I originally copied it from the remove_phix process, but it can be removed here.

Comment thread bin/multiqc_to_custom_tsv.py Outdated
Comment thread main.nf

script:
def sensitivity = params.host_removal_verysensitive ? "--very-sensitive" : "--sensitive"
if ( !params.single_end ) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for avoiding to duplicate most of the process code (sin gle end and paired end), you can define an input parameter, like at https://github.com/nf-core/mag/blob/master/main.nf#L672

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but in this case there are multiple parts in the process code affected:

  1. -1 "${reads[0]}" -2 "${reads[1]}"
  2. --un-conc-gz ${name}_host_unmapped_%.fastq.gz \
    --al-conc-gz ${name}_host_mapped_%.fastq.gz \
  3. zcat ${name}_host_mapped_1.fastq.gz | awk '{if(NR%4==1) print substr(\$0, 2)}' > ${name}_host_mapped_1.read_ids.txt
    zcat ${name}_host_mapped_2.fastq.gz | awk '{if(NR%4==1) print substr(\$0, 2)}' > ${name}_host_mapped_2.read_ids.txt

Maybe in this case it doesn't necessarily get cleaner if solved like this?

@skrakau skrakau force-pushed the add_host_removal branch from 071ae48 to 86ee5c2 Compare June 3, 2020 15:40
@skrakau

skrakau commented Jun 3, 2020

Copy link
Copy Markdown
Member Author

OK, I now added a test using the --host_fasta option and some brief documentation.

Currently this host removal only works for short reads. If --host_fasta or --host_genome is specified in combination with --manifest, an error with a corresponding message is returned. Is that OK?

@skrakau skrakau requested a review from HadrienG June 4, 2020 12:46
@skrakau

skrakau commented Jun 5, 2020

Copy link
Copy Markdown
Member Author

In the MAG test.config file temp_dir is set to a folder within outdir. So far I didn't do that for test_host_rm.config. Is this necessary for this pipeline, and if yes why?

@skrakau

skrakau commented Jun 8, 2020

Copy link
Copy Markdown
Member Author

Hi @ewels or @apeltzer, could one of you tell me by any chance what the purpose of the line temp_dir = "./tests/tmp_dir" in the test.config files is (it's also used in the ampliseq pipline, but didn't find any documentation)? Still wondering if it is ok to just omit it in the new conf/test_host_rm.config file for this PR.

@ewels

ewels commented Jun 8, 2020

Copy link
Copy Markdown
Member

Link to line in question:

params.temp_dir = "./tests/tmp_dir"

Nothing to do with me I'm afraid. Looks like it was added by @HadrienG

Phil

@ewels

ewels commented Jun 8, 2020

Copy link
Copy Markdown
Member

(but I agree, I can't see anything obvious that it is doing, and I suspect that it can be removed from both config files)

@skrakau

skrakau commented Jun 8, 2020

Copy link
Copy Markdown
Member Author

OK, thanks @ewels !

@d4straub

d4straub commented Jun 9, 2020

Copy link
Copy Markdown
Collaborator

Currently this host removal only works for short reads. If --host_fasta or --host_genome is specified in combination with --manifest, an error with a corresponding message is returned. Is that OK?

I think that's fine. Using proper settings for ONT qc (e.g. --longreads_length_weight 1) will remove those that have no matches with Illumina reads, so ONT reads can be filtered that way as well. And SPAdes anyway is using ONT reads only for resolving uncertainties in the assembly graph including gap closure and therefore ONT host reads are relatively unlikely to mess up the assembly, I think.

@skrakau

skrakau commented Jun 9, 2020

Copy link
Copy Markdown
Member Author

Ok, will change it then so that the already filtered short reads will be used to filter the long reads.

@skrakau skrakau force-pushed the add_host_removal branch from 560b7c3 to e3db180 Compare June 10, 2020 19:02
@skrakau

skrakau commented Jun 11, 2020

Copy link
Copy Markdown
Member Author

Hi @d4straub,

thanks for your input. The following points were added/changed now:

  • filtlong uses now the already filtered short reads as a reference (to indirectly filter out long host reads), for this I changed the channel files_all_raw to files_long_rawand joined this with the filtered short reads
  • accordingly I also changed the order of the processes in the code just for the sake of clarity: first short reads preprocessing, then long reads preprocessing
  • if host removal is turned on in combination with long reads, warnings are given out to inform the user that long reads are only filtered indirectly, and informing about the effect of --longreads_length_weight (if > 1)
  • added some information about this to the docs

I tested this locally for a very basic example, and with --longreads_length_weight 1 long reads mapping to host sequences were discarded as expected, reads mapping to the 'minigut' sequences not.

  • added missing parameters to the summary

Best,
Sabrina

Comment thread conf/test_host_rm.config
@d4straub

d4straub commented Jun 15, 2020

Copy link
Copy Markdown
Collaborator

If I am not mistaken there are 3 tests now:

  • Illumina data
  • Illumina data + host removal
  • Illumina + ONT data

wouldn't it be good to test also

  • Illumina + ONT data + host removal

My reasoning is that there are now some channels that are not tested otherwise. And further changes breaking processes that use these channels might go undetected when the test is missing. What do you think?

edit: layout

@skrakau

skrakau commented Jun 16, 2020

Copy link
Copy Markdown
Member Author

Ok, I added a corresponding test, but it is more for channel testing, as I didn't add host reads to the long read dataset.

@d4straub d4straub left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Looks good!

@skrakau

skrakau commented Jun 18, 2020

Copy link
Copy Markdown
Member Author

I made the saving of the host read ids optional and added some sorting, since the order of the Bowtie 2 results is not reproducible.

@d4straub d4straub left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@d4straub d4straub dismissed HadrienG’s stale review June 18, 2020 14:06

Requested changes were either addressed or reasonably dismissed

@skrakau skrakau merged commit 22464a2 into nf-core:dev Jun 18, 2020
@skrakau skrakau deleted the add_host_removal branch May 31, 2021 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants