Summary
Replace gene-level P-site aggregation (QUANTIFY_INFRAME_PSITE_PLASTID) with per-ORF counting. Each ORF (main CDS, uORF, novel smORF) gets its own count row, enabling ORF-level DTE and uORF-mediated regulation detection.
Blocked by: #161 (canonical backbone), #165 (hybrid GTF wired into callers). Part of --extended_orf_analysis.
Current behaviour
bin/gtf_to_inframe_psites.awk is called with FEATURE=gene (see workflows/riboseq/main.nf ~line 547), which aggregates all in-frame P-sites to the gene level. The per-ORF version requires FEATURE=transcript (or ORF-ID-based handling) so each ORF gets its own coordinate set.
Implementation options
Option A — Extend existing Plastid wiggle approach (recommended): Keep PLASTID_MAKE_WIGGLE. Replace gtf_to_inframe_psites.awk with a script that accepts the merged ORF catalogue (BED12 or GTF) and emits codon-start positions using the ORF's own ATG as frame 0 (independent of GTF phase field). Reuse the bedtools intersect + groupby pattern. Only option giving true codon-start counting.
Option B — Plastid counts_in_region: counts_in_region --fiveprime_variable --offset p_offsets.txt --annotation_files merged_orfs.bed. Source confirmed: calls numpy.nansum(ivc.get_masked_counts(ga)) — sums all P-site positions in the span, not codon-start only. Less precise than A but avoids custom post-processing.
Option C — DOTSeq flattening + featureCounts: orf_to_gtf.py creates a non-overlapping flattened GTF; featureCounts counts full RPF reads per ORF. Handles same-frame overlapping ORFs correctly; requires DOTSeq (currently Bioconductor dev only).
ORF ID scheme
Stable identifiers encoding enough information for reproducibility:
- Canonical CDS:
ENST00000123456.5_CDS_100_500
- Non-canonical:
chr1_+_10000_10150_frame0
Row names in the count matrix must be consistent between Ribo-seq (numerator) and RNA-seq (denominator) matrices for DTE.
RNA-seq denominator assignment (for DTE)
| ORF class |
Denominator |
| Canonical CDS ORF |
Gene-level RNA-seq count |
| uORF / dORF on annotated transcript |
Transcript-level Salmon count |
| ORF on StringTie novel transcript |
Transcript-level count (Salmon against hybrid GTF) |
| Novel intergenic ORF, no host transcript |
None — counts only, excluded from DTE |
Same-frame overlap caveat
Two ORFs in the same frame that overlap (e.g. N-terminally extended CDS variants) will both claim the same in-frame P-site under Options A and B. This is rare in canonical annotations but real for some ORF classes. Add explicit test cases for same-frame overlap to document behaviour. Option C (DOTSeq flattening) handles this automatically.
anota2seq compatibility note
P-site counts are sparser and more zero-inflated than Salmon TPM-derived counts. anota2seq's APV regression default filtering thresholds were calibrated on RNA-seq-like distributions — validate empirically and adjust minimum-count thresholds for P-site input before releasing (tracked in issue #167).
Cross-tool consensus: fuzzy overlap required
Benchmark finding: strict (chrom, start, end, strand) coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity. ORF catalogue merging (issue #167) must use fuzzy overlap (≥80% reciprocal bedtools-style), not coordinate equality.
Outputs
References
Summary
Replace gene-level P-site aggregation (
QUANTIFY_INFRAME_PSITE_PLASTID) with per-ORF counting. Each ORF (main CDS, uORF, novel smORF) gets its own count row, enabling ORF-level DTE and uORF-mediated regulation detection.Blocked by: #161 (canonical backbone), #165 (hybrid GTF wired into callers). Part of
--extended_orf_analysis.Current behaviour
bin/gtf_to_inframe_psites.awkis called withFEATURE=gene(seeworkflows/riboseq/main.nf~line 547), which aggregates all in-frame P-sites to the gene level. The per-ORF version requiresFEATURE=transcript(or ORF-ID-based handling) so each ORF gets its own coordinate set.Implementation options
Option A — Extend existing Plastid wiggle approach (recommended): Keep
PLASTID_MAKE_WIGGLE. Replacegtf_to_inframe_psites.awkwith a script that accepts the merged ORF catalogue (BED12 or GTF) and emits codon-start positions using the ORF's own ATG as frame 0 (independent of GTFphasefield). Reuse thebedtools intersect + groupbypattern. Only option giving true codon-start counting.Option B — Plastid
counts_in_region:counts_in_region --fiveprime_variable --offset p_offsets.txt --annotation_files merged_orfs.bed. Source confirmed: callsnumpy.nansum(ivc.get_masked_counts(ga))— sums all P-site positions in the span, not codon-start only. Less precise than A but avoids custom post-processing.Option C — DOTSeq flattening + featureCounts:
orf_to_gtf.pycreates a non-overlapping flattened GTF; featureCounts counts full RPF reads per ORF. Handles same-frame overlapping ORFs correctly; requires DOTSeq (currently Bioconductor dev only).ORF ID scheme
Stable identifiers encoding enough information for reproducibility:
ENST00000123456.5_CDS_100_500chr1_+_10000_10150_frame0Row names in the count matrix must be consistent between Ribo-seq (numerator) and RNA-seq (denominator) matrices for DTE.
RNA-seq denominator assignment (for DTE)
Same-frame overlap caveat
Two ORFs in the same frame that overlap (e.g. N-terminally extended CDS variants) will both claim the same in-frame P-site under Options A and B. This is rare in canonical annotations but real for some ORF classes. Add explicit test cases for same-frame overlap to document behaviour. Option C (DOTSeq flattening) handles this automatically.
anota2seq compatibility note
P-site counts are sparser and more zero-inflated than Salmon TPM-derived counts. anota2seq's APV regression default filtering thresholds were calibrated on RNA-seq-like distributions — validate empirically and adjust minimum-count thresholds for P-site input before releasing (tracked in issue #167).
Cross-tool consensus: fuzzy overlap required
Benchmark finding: strict (chrom, start, end, strand) coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity. ORF catalogue merging (issue #167) must use fuzzy overlap (≥80% reciprocal bedtools-style), not coordinate equality.
Outputs
orf_psite_counts.tsv: ORF × sample raw count matrixorf_catalogue.gtf/.bed12: merged ORF catalogue used for countingorf_to_gene.tsv: ORF ID → host gene ID mapping (for gene-level pre-aggregation, issue feat: cross-sample ORF catalogue — merge, deduplicate, and re-quantify across all samples #167)References
counts_in_region: gene expression tutorialmodules/local/quantify_inframe_psite_plastid/,bin/gtf_to_inframe_psites.awk