feat: ORF-level P-site quantification — replace gene-level counting with per-ORF counts

## Summary

Replace gene-level P-site aggregation (`QUANTIFY_INFRAME_PSITE_PLASTID`) with per-ORF counting. Each ORF (main CDS, uORF, novel smORF) gets its own count row, enabling ORF-level DTE and uORF-mediated regulation detection.

**Blocked by:** #161 (canonical backbone), #165 (hybrid GTF wired into callers). Part of `--extended_orf_analysis`.

## Current behaviour

`bin/gtf_to_inframe_psites.awk` is called with `FEATURE=gene` (see `workflows/riboseq/main.nf` ~line 547), which aggregates all in-frame P-sites to the gene level. The per-ORF version requires `FEATURE=transcript` (or ORF-ID-based handling) so each ORF gets its own coordinate set.

## Implementation options

**Option A — Extend existing Plastid wiggle approach (recommended):** Keep `PLASTID_MAKE_WIGGLE`. Replace `gtf_to_inframe_psites.awk` with a script that accepts the merged ORF catalogue (BED12 or GTF) and emits codon-start positions using the ORF's own ATG as frame 0 (independent of GTF `phase` field). Reuse the `bedtools intersect + groupby` pattern. **Only option giving true codon-start counting.**

**Option B — Plastid `counts_in_region`:** `counts_in_region --fiveprime_variable --offset p_offsets.txt --annotation_files merged_orfs.bed`. Source confirmed: calls `numpy.nansum(ivc.get_masked_counts(ga))` — sums **all** P-site positions in the span, not codon-start only. Less precise than A but avoids custom post-processing.

**Option C — DOTSeq flattening + featureCounts:** `orf_to_gtf.py` creates a non-overlapping flattened GTF; featureCounts counts full RPF reads per ORF. Handles same-frame overlapping ORFs correctly; requires DOTSeq (currently Bioconductor dev only).

## ORF ID scheme

Stable identifiers encoding enough information for reproducibility:
- Canonical CDS: `ENST00000123456.5_CDS_100_500`
- Non-canonical: `chr1_+_10000_10150_frame0`

Row names in the count matrix must be consistent between Ribo-seq (numerator) and RNA-seq (denominator) matrices for DTE.

## RNA-seq denominator assignment (for DTE)

| ORF class | Denominator |
|---|---|
| Canonical CDS ORF | Gene-level RNA-seq count |
| uORF / dORF on annotated transcript | Transcript-level Salmon count |
| ORF on StringTie novel transcript | Transcript-level count (Salmon against hybrid GTF) |
| Novel intergenic ORF, no host transcript | None — counts only, excluded from DTE |

## Same-frame overlap caveat

Two ORFs in the **same frame** that overlap (e.g. N-terminally extended CDS variants) will both claim the same in-frame P-site under Options A and B. This is rare in canonical annotations but real for some ORF classes. Add explicit test cases for same-frame overlap to document behaviour. Option C (DOTSeq flattening) handles this automatically.

## anota2seq compatibility note

P-site counts are sparser and more zero-inflated than Salmon TPM-derived counts. anota2seq's APV regression default filtering thresholds were calibrated on RNA-seq-like distributions — validate empirically and adjust minimum-count thresholds for P-site input before releasing (tracked in issue #167).

## Cross-tool consensus: fuzzy overlap required

Benchmark finding: strict (chrom, start, end, strand) coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity. ORF catalogue merging (issue #167) must use fuzzy overlap (≥80% reciprocal bedtools-style), not coordinate equality.

## Outputs

- `orf_psite_counts.tsv`: ORF × sample raw count matrix
- `orf_catalogue.gtf` / `.bed12`: merged ORF catalogue used for counting
- `orf_to_gene.tsv`: ORF ID → host gene ID mapping (for gene-level pre-aggregation, issue #167)

## References

- ORFquant (23% dominant ORF on non-principal isoform): [Calviello et al. Nat Struct Mol Biol 2020](https://doi.org/10.1038/s41594-020-0450-4)
- DOTSeq: [Lim & Chieng, bioRxiv 2025](https://doi.org/10.1101/2025.09.24.678201)
- Plastid `counts_in_region`: [gene expression tutorial](https://plastid.readthedocs.io/en/latest/examples/gene_expression.html)
- Current code: `modules/local/quantify_inframe_psite_plastid/`, `bin/gtf_to_inframe_psites.awk`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ORF-level P-site quantification — replace gene-level counting with per-ORF counts #166

Summary

Current behaviour

Implementation options

ORF ID scheme

RNA-seq denominator assignment (for DTE)

Same-frame overlap caveat

anota2seq compatibility note

Cross-tool consensus: fuzzy overlap required

Outputs

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ORF class	Denominator
Canonical CDS ORF	Gene-level RNA-seq count
uORF / dORF on annotated transcript	Transcript-level Salmon count
ORF on StringTie novel transcript	Transcript-level count (Salmon against hybrid GTF)
Novel intergenic ORF, no host transcript	None — counts only, excluded from DTE

feat: ORF-level P-site quantification — replace gene-level counting with per-ORF counts #166

Description

Summary

Current behaviour

Implementation options

ORF ID scheme

RNA-seq denominator assignment (for DTE)

Same-frame overlap caveat

anota2seq compatibility note

Cross-tool consensus: fuzzy overlap required

Outputs

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions