Summary
Three refinements to PR #158 (StringTie novel transcript discovery):
- Accept user-supplied
--novel_gtf to bypass assembly entirely
- Use RNA-seq BAMs for StringTie when available (Ribo-seq BAMs are a fallback)
- Filter output to gffcompare class
u (intergenic-only) before passing downstream
Blocked by: PR #158 merged. Part of --extended_orf_analysis opt-in flag.
1. User-supplied GTF (--novel_gtf)
Users with a pre-built cell-type-specific annotation (e.g. from long-read RNA-seq or a published tissue-specific catalogue) can supply it directly via --novel_gtf, skipping StringTie assembly. The supplied GTF is passed through the same gffcompare class u filter and used identically downstream. StringTie only runs when --novel_gtf is absent.
2. Prefer RNA-seq BAMs for StringTie
Every published microprotein workflow assembles from RNA-seq, not Ribo-seq: RIBOSS (Lim et al. 2025), Rp3 (Vieira de Souza et al. 2024), Wu et al. tomato (2019), ggRibo Arabidopsis protocol (PMID: 41241901). Ribo-seq footprints produce shallow, sparse overlap graphs in StringTie with non-uniform P-site-biased coverage, leading to fragmented assemblies and false-positive single-exon "transcripts" from ribosome pause sites.
// in workflows/riboseq/main.nf
ch_stringtie_input_bam = rnaseq_bams_available
? ch_rnaseq_genome_bam
: ch_genome_bam // Ribo-seq fallback
When using the Ribo-seq fallback, emit a pipeline warning and apply tightened ext.args:
| Parameter |
RNA-seq default |
Ribo-seq fallback |
-m (min transcript length) |
200 bp |
100 bp |
-c (min coverage) |
1.0 |
5.0 reads/bp |
-j (min junction reads) |
1.0 |
3.0 |
-f (min isoform fraction) |
0.01 |
0.05 |
-g (max gap) |
50 bp |
100 bp |
Post-assembly: discard transcripts with TPM < 0.5 (following Wu et al. 2019).
These Ribo-seq fallback parameters are first-pass empirical defaults — not literature-validated for Ribo-seq BAM input. Treat as a starting point pending quality assessment on real data (assembly size, length distribution, class code breakdown).
3. gffcompare + class code filter
After STRINGTIE_MERGE:
# 1. Run gffcompare
gffcompare -r reference.gtf stringtie_merged.gtf -o gffcmp
# 2. Filter to class 'u' (zero overlap with any annotated gene on either strand)
awk '$3=="transcript"||$3=="exon"' gffcmp.annotated.gtf \
| grep 'class_code "u"' > novel_intergenic.gtf
# 3. Concatenate with canonical backbone (issue #161)
cat canonical_backbone.gtf novel_intergenic.gtf > hybrid_reference.gtf
Add --stringtie_class_codes parameter (default "u"). For fully stranded RNA-seq + stranded Ribo-seq, class x (antisense) can be added via --stringtie_class_codes "u,x" to recover genuinely translated antisense transcripts.
Post-filter: optionally intersect with --rrna_blacklist BED to remove rRNA/repeat artefacts. Make blacklist step conditional (skip silently if absent with a warning) — most non-human organisms lack a published equivalent.
References
Summary
Three refinements to PR #158 (StringTie novel transcript discovery):
--novel_gtfto bypass assembly entirelyu(intergenic-only) before passing downstreamBlocked by: PR #158 merged. Part of
--extended_orf_analysisopt-in flag.1. User-supplied GTF (
--novel_gtf)Users with a pre-built cell-type-specific annotation (e.g. from long-read RNA-seq or a published tissue-specific catalogue) can supply it directly via
--novel_gtf, skipping StringTie assembly. The supplied GTF is passed through the same gffcompare classufilter and used identically downstream. StringTie only runs when--novel_gtfis absent.2. Prefer RNA-seq BAMs for StringTie
Every published microprotein workflow assembles from RNA-seq, not Ribo-seq: RIBOSS (Lim et al. 2025), Rp3 (Vieira de Souza et al. 2024), Wu et al. tomato (2019), ggRibo Arabidopsis protocol (PMID: 41241901). Ribo-seq footprints produce shallow, sparse overlap graphs in StringTie with non-uniform P-site-biased coverage, leading to fragmented assemblies and false-positive single-exon "transcripts" from ribosome pause sites.
When using the Ribo-seq fallback, emit a pipeline warning and apply tightened
ext.args:-m(min transcript length)-c(min coverage)-j(min junction reads)-f(min isoform fraction)-g(max gap)Post-assembly: discard transcripts with TPM < 0.5 (following Wu et al. 2019).
These Ribo-seq fallback parameters are first-pass empirical defaults — not literature-validated for Ribo-seq BAM input. Treat as a starting point pending quality assessment on real data (assembly size, length distribution, class code breakdown).
3. gffcompare + class code filter
After
STRINGTIE_MERGE:Add
--stringtie_class_codesparameter (default"u"). For fully stranded RNA-seq + stranded Ribo-seq, classx(antisense) can be added via--stringtie_class_codes "u,x"to recover genuinely translated antisense transcripts.Post-filter: optionally intersect with
--rrna_blacklistBED to remove rRNA/repeat artefacts. Make blacklist step conditional (skip silently if absent with a warning) — most non-human organisms lack a published equivalent.References