Skip to content

feat: ORF-level P-site quantification — replace gene-level counting with per-ORF counts #166

@pinin4fjords

Description

@pinin4fjords

Summary

Replace gene-level P-site aggregation (QUANTIFY_INFRAME_PSITE_PLASTID) with per-ORF counting. Each ORF (main CDS, uORF, novel smORF) gets its own count row, enabling ORF-level DTE and uORF-mediated regulation detection.

Blocked by: #161 (canonical backbone), #165 (hybrid GTF wired into callers). Part of --extended_orf_analysis.

Current behaviour

bin/gtf_to_inframe_psites.awk is called with FEATURE=gene (see workflows/riboseq/main.nf ~line 547), which aggregates all in-frame P-sites to the gene level. The per-ORF version requires FEATURE=transcript (or ORF-ID-based handling) so each ORF gets its own coordinate set.

Implementation options

Option A — Extend existing Plastid wiggle approach (recommended): Keep PLASTID_MAKE_WIGGLE. Replace gtf_to_inframe_psites.awk with a script that accepts the merged ORF catalogue (BED12 or GTF) and emits codon-start positions using the ORF's own ATG as frame 0 (independent of GTF phase field). Reuse the bedtools intersect + groupby pattern. Only option giving true codon-start counting.

Option B — Plastid counts_in_region: counts_in_region --fiveprime_variable --offset p_offsets.txt --annotation_files merged_orfs.bed. Source confirmed: calls numpy.nansum(ivc.get_masked_counts(ga)) — sums all P-site positions in the span, not codon-start only. Less precise than A but avoids custom post-processing.

Option C — DOTSeq flattening + featureCounts: orf_to_gtf.py creates a non-overlapping flattened GTF; featureCounts counts full RPF reads per ORF. Handles same-frame overlapping ORFs correctly; requires DOTSeq (currently Bioconductor dev only).

ORF ID scheme

Stable identifiers encoding enough information for reproducibility:

  • Canonical CDS: ENST00000123456.5_CDS_100_500
  • Non-canonical: chr1_+_10000_10150_frame0

Row names in the count matrix must be consistent between Ribo-seq (numerator) and RNA-seq (denominator) matrices for DTE.

RNA-seq denominator assignment (for DTE)

ORF class Denominator
Canonical CDS ORF Gene-level RNA-seq count
uORF / dORF on annotated transcript Transcript-level Salmon count
ORF on StringTie novel transcript Transcript-level count (Salmon against hybrid GTF)
Novel intergenic ORF, no host transcript None — counts only, excluded from DTE

Same-frame overlap caveat

Two ORFs in the same frame that overlap (e.g. N-terminally extended CDS variants) will both claim the same in-frame P-site under Options A and B. This is rare in canonical annotations but real for some ORF classes. Add explicit test cases for same-frame overlap to document behaviour. Option C (DOTSeq flattening) handles this automatically.

anota2seq compatibility note

P-site counts are sparser and more zero-inflated than Salmon TPM-derived counts. anota2seq's APV regression default filtering thresholds were calibrated on RNA-seq-like distributions — validate empirically and adjust minimum-count thresholds for P-site input before releasing (tracked in issue #167).

Cross-tool consensus: fuzzy overlap required

Benchmark finding: strict (chrom, start, end, strand) coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity. ORF catalogue merging (issue #167) must use fuzzy overlap (≥80% reciprocal bedtools-style), not coordinate equality.

Outputs

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions