feat: cross-sample ORF catalogue — merge, deduplicate, and re-quantify across all samples

## Summary

ORF callers run per-sample producing per-sample ORF lists. For a consistent count matrix, these must be merged into a single run-level catalogue before quantification — analogous to merging ATAC-seq peaks into a consensus set before re-quantification.

**Blocked by:** #166 (ORF-level quantification, which defines the ORF catalogue format). Part of `--extended_orf_analysis`.

## Merging strategy

### Canonical ORFs (>100 aa)

**Two-path approach** to handle spliced ORFs:
- **Annotated multi-exon CDS ORFs:** collapse by shared transcript ID across samples. These are already uniquely identified; coordinate merging would incorrectly merge ORFs sharing genomic span but with different intron structures.
- **Single-exon novel intergenic ORFs (class `u`):** coordinate-based strand-aware merge:
  ```bash
  cat sample*.bed | sort -k1,1 -k2,2n | bedtools merge -s > merged_canonical.bed
  ```

Near-identical ORFs with caller-specific TIS uncertainty: collapse using **99% amino acid identity + shared stop codon** — the criterion used by the GENCODE Ribo-seq ORF catalogue (Ruiz-Orera et al. Nat Biotechnol 2022). Not a nucleotide-window heuristic.

### Small ORFs (≤100 aa — the smORF threshold)

The 100 aa / 300 nt cutoff is the field standard: sORFs.org (NAR 2016), GENCODE Ribo-seq catalogue, SmProt v2. After coordinate merge, apply sequence-based deduplication — smORFs in repetitive regions may be called at multiple genomic copies representing the same peptide:

```bash
# CD-HIT at 100% identity
cd-hit -i merged_smorfs.faa -o deduped.faa -c 1.00 -n 5 -l 10 -M 4000 -T 8

# Or MMseqs2 (faster for large catalogues; --cluster-mode 2 for short peptides)
mmseqs easy-cluster merged_smorfs.faa result tmp/ --min-seq-id 1.0 --cov-mode 0 -c 1.0 --cluster-mode 2
```

### Recurrence filter (optional)

Require each ORF called in ≥2 samples (or ≥1 per condition) before catalogue inclusion. Reduces single-sample false positives without hard intersection.

### Re-quantification

All samples re-quantified against the merged catalogue (issue #166). Samples where an ORF was not called have zero counts — zero is a valid observation, not missing data.

## Cross-tool consensus: fuzzy overlap required

Benchmark finding (May 2026): strict coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity — tools assign different start coordinates to the same biological ORF. **Do not use strict (chrom, start, end, strand) equality joins for cross-tool merging.** Use bedtools-style fuzzy overlap (≥80% reciprocal overlap) or gene-level aggregation.

## Output format

Per-tool output harmonisation needed before merging:
- RiboCode: `-b` flag → BED12 (cleanest)
- Ribo-TISH: `--blocks` → BED-like block coordinates
- These need a normalisation step before catalogue merge

Outputs:
- `orf_catalogue.gtf` / `orf_catalogue.bed12`
- `orf_catalogue.faa` (amino acid sequences for sequence deduplication)
- `orf_to_gene.tsv` (ORF ID → host gene ID, for gene-level pre-aggregation in issue #167)
- MultiQC summary: ORF counts per class in final catalogue

## Reference implementation

[gencode-riboseqORFs](https://github.com/jorruior/gencode-riboseqORFs) — the tool used to build the GENCODE Phase I Ribo-seq ORF catalogue — implements coordinate + sequence-based collapse and can serve as a reference or be wrapped as a local module.

## References

- GENCODE Ribo-seq ORF catalogue: [Ruiz-Orera et al. Nat Biotechnol 2022](https://pubmed.ncbi.nlm.nih.gov/36456725/)
- sORFs.org: [Olexiouk et al. NAR 2016](https://pubmed.ncbi.nlm.nih.gov/26527729/)
- MMseqs2: [Steinegger & Söding, Nat Commun 2018](https://doi.org/10.1038/s41467-018-04964-5)
- ATAC-seq consensus peak re-quantification pattern: [Grandi et al. Genome Biology 2022](https://link.springer.com/article/10.1186/s13059-020-1929-3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cross-sample ORF catalogue — merge, deduplicate, and re-quantify across all samples #167

Summary

Merging strategy

Canonical ORFs (>100 aa)

Small ORFs (≤100 aa — the smORF threshold)

Recurrence filter (optional)

Re-quantification

Cross-tool consensus: fuzzy overlap required

Output format

Reference implementation

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: cross-sample ORF catalogue — merge, deduplicate, and re-quantify across all samples #167

Description

Summary

Merging strategy

Canonical ORFs (>100 aa)

Small ORFs (≤100 aa — the smORF threshold)

Recurrence filter (optional)

Re-quantification

Cross-tool consensus: fuzzy overlap required

Output format

Reference implementation

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions