Skip to content

feat: cross-sample ORF catalogue — merge, deduplicate, and re-quantify across all samples #167

@pinin4fjords

Description

@pinin4fjords

Summary

ORF callers run per-sample producing per-sample ORF lists. For a consistent count matrix, these must be merged into a single run-level catalogue before quantification — analogous to merging ATAC-seq peaks into a consensus set before re-quantification.

Blocked by: #166 (ORF-level quantification, which defines the ORF catalogue format). Part of --extended_orf_analysis.

Merging strategy

Canonical ORFs (>100 aa)

Two-path approach to handle spliced ORFs:

  • Annotated multi-exon CDS ORFs: collapse by shared transcript ID across samples. These are already uniquely identified; coordinate merging would incorrectly merge ORFs sharing genomic span but with different intron structures.
  • Single-exon novel intergenic ORFs (class u): coordinate-based strand-aware merge:
    cat sample*.bed | sort -k1,1 -k2,2n | bedtools merge -s > merged_canonical.bed

Near-identical ORFs with caller-specific TIS uncertainty: collapse using 99% amino acid identity + shared stop codon — the criterion used by the GENCODE Ribo-seq ORF catalogue (Ruiz-Orera et al. Nat Biotechnol 2022). Not a nucleotide-window heuristic.

Small ORFs (≤100 aa — the smORF threshold)

The 100 aa / 300 nt cutoff is the field standard: sORFs.org (NAR 2016), GENCODE Ribo-seq catalogue, SmProt v2. After coordinate merge, apply sequence-based deduplication — smORFs in repetitive regions may be called at multiple genomic copies representing the same peptide:

# CD-HIT at 100% identity
cd-hit -i merged_smorfs.faa -o deduped.faa -c 1.00 -n 5 -l 10 -M 4000 -T 8

# Or MMseqs2 (faster for large catalogues; --cluster-mode 2 for short peptides)
mmseqs easy-cluster merged_smorfs.faa result tmp/ --min-seq-id 1.0 --cov-mode 0 -c 1.0 --cluster-mode 2

Recurrence filter (optional)

Require each ORF called in ≥2 samples (or ≥1 per condition) before catalogue inclusion. Reduces single-sample false positives without hard intersection.

Re-quantification

All samples re-quantified against the merged catalogue (issue #166). Samples where an ORF was not called have zero counts — zero is a valid observation, not missing data.

Cross-tool consensus: fuzzy overlap required

Benchmark finding (May 2026): strict coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity — tools assign different start coordinates to the same biological ORF. Do not use strict (chrom, start, end, strand) equality joins for cross-tool merging. Use bedtools-style fuzzy overlap (≥80% reciprocal overlap) or gene-level aggregation.

Output format

Per-tool output harmonisation needed before merging:

  • RiboCode: -b flag → BED12 (cleanest)
  • Ribo-TISH: --blocks → BED-like block coordinates
  • These need a normalisation step before catalogue merge

Outputs:

Reference implementation

gencode-riboseqORFs — the tool used to build the GENCODE Phase I Ribo-seq ORF catalogue — implements coordinate + sequence-based collapse and can serve as a reference or be wrapped as a local module.

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions