Skip to content

feat: StringTie novel transcript discovery — RNA-seq BAMs preferred, user-supplied GTF bypass, intergenic-only filter #164

@pinin4fjords

Description

@pinin4fjords

Summary

Three refinements to PR #158 (StringTie novel transcript discovery):

  1. Accept user-supplied --novel_gtf to bypass assembly entirely
  2. Use RNA-seq BAMs for StringTie when available (Ribo-seq BAMs are a fallback)
  3. Filter output to gffcompare class u (intergenic-only) before passing downstream

Blocked by: PR #158 merged. Part of --extended_orf_analysis opt-in flag.

1. User-supplied GTF (--novel_gtf)

Users with a pre-built cell-type-specific annotation (e.g. from long-read RNA-seq or a published tissue-specific catalogue) can supply it directly via --novel_gtf, skipping StringTie assembly. The supplied GTF is passed through the same gffcompare class u filter and used identically downstream. StringTie only runs when --novel_gtf is absent.

2. Prefer RNA-seq BAMs for StringTie

Every published microprotein workflow assembles from RNA-seq, not Ribo-seq: RIBOSS (Lim et al. 2025), Rp3 (Vieira de Souza et al. 2024), Wu et al. tomato (2019), ggRibo Arabidopsis protocol (PMID: 41241901). Ribo-seq footprints produce shallow, sparse overlap graphs in StringTie with non-uniform P-site-biased coverage, leading to fragmented assemblies and false-positive single-exon "transcripts" from ribosome pause sites.

// in workflows/riboseq/main.nf
ch_stringtie_input_bam = rnaseq_bams_available
    ? ch_rnaseq_genome_bam
    : ch_genome_bam  // Ribo-seq fallback

When using the Ribo-seq fallback, emit a pipeline warning and apply tightened ext.args:

Parameter RNA-seq default Ribo-seq fallback
-m (min transcript length) 200 bp 100 bp
-c (min coverage) 1.0 5.0 reads/bp
-j (min junction reads) 1.0 3.0
-f (min isoform fraction) 0.01 0.05
-g (max gap) 50 bp 100 bp

Post-assembly: discard transcripts with TPM < 0.5 (following Wu et al. 2019).

These Ribo-seq fallback parameters are first-pass empirical defaults — not literature-validated for Ribo-seq BAM input. Treat as a starting point pending quality assessment on real data (assembly size, length distribution, class code breakdown).

3. gffcompare + class code filter

After STRINGTIE_MERGE:

# 1. Run gffcompare
gffcompare -r reference.gtf stringtie_merged.gtf -o gffcmp

# 2. Filter to class 'u' (zero overlap with any annotated gene on either strand)
awk '$3=="transcript"||$3=="exon"' gffcmp.annotated.gtf \
    | grep 'class_code "u"' > novel_intergenic.gtf

# 3. Concatenate with canonical backbone (issue #161)
cat canonical_backbone.gtf novel_intergenic.gtf > hybrid_reference.gtf

Add --stringtie_class_codes parameter (default "u"). For fully stranded RNA-seq + stranded Ribo-seq, class x (antisense) can be added via --stringtie_class_codes "u,x" to recover genuinely translated antisense transcripts.

Post-filter: optionally intersect with --rrna_blacklist BED to remove rRNA/repeat artefacts. Make blacklist step conditional (skip silently if absent with a warning) — most non-human organisms lack a published equivalent.

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions