Skip to content

Conversation

@SPPearce
Copy link
Contributor

@SPPearce SPPearce commented Jul 9, 2025

Big set of changes to UMI processing.
Currently the only UMI processing is through generating consensus reads using fgbio, which is inefficient for whole genome sequencing.
Often the UMI might have already been extracted into the read header (such as through bclconvert) or maybe in the bam file already.

Introduces:

  • fastp for UMI extraction from reads (R1/R2/both or index reads in fastq header) [test added]
  • movement of UMIs from read name to RX: tag on bam/cram files, allowing MarkDuplicates and Sentieon to get at them. This is required if extracted using fastp, or if they were provided originally in that way. [test added for MarkDuplicates]
  • If using fgbio consensus, then this will now support UMIs in read names.
  • sentieon consensus mode (technically doesn't require UMIs, but while I was fiddling with the config...)
  • Ensures that fgbio merges together the lanes prior to consensus generation (fixes MarkDuplicates step fails with UMIs from read names and 4 lanes #802). I also swapped to samtools rather than samblaster to do the merge and sort, as is done in the fastquorum pipeline; this removes samblaster entirely (I couldn't get samblaster to merge the multiple files in one go).
  • Uses Fulcrum Genomics plugin to validate the read structures, see feat: validate read structures with the nf-fgbio plugin fastquorum#123
  • If using sentieon dedup from bam/cram files without first aligning, the UMIs will only be used if one of the two parameters above are set or umi_tag parameter is set.
  • MarkDuplicates will automatically use the RX tag if it is present in the bam files, but sentieon does not.

Note that long-term I think we should we just be sending people to fastquorum if they want consensus reads rather than implementing in sarek.

@SPPearce SPPearce marked this pull request as draft July 9, 2025 11:15
@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@github-actions
Copy link

github-actions bot commented Jul 9, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit ba0bcec

+| ✅ 225 tests passed       |+
#| ❔  12 tests were ignored |#
!| ❗   7 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • schema_lint - Input mimetype is missing or empty
  • schema_description - No description provided in schema for parameter: markduplicates_pixel_distance
  • schema_description - No description provided in schema for parameter: gatk_pcr_indel_model

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.3.2
  • Run at 2025-09-08 12:49:36

Co-authored-by: Maxime U Garcia <max.u.garcia@gmail.com>
@SPPearce SPPearce marked this pull request as ready for review August 28, 2025 14:14
@maxulysse maxulysse changed the title [DRAFT] Changes to UMI processing Changes to UMI processing Aug 28, 2025
@SPPearce SPPearce merged commit 78c6e3e into dev Sep 8, 2025
102 of 107 checks passed
@maxulysse maxulysse deleted the umi_processing branch September 8, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants