Summary
ORF callers run per-sample producing per-sample ORF lists. For a consistent count matrix, these must be merged into a single run-level catalogue before quantification — analogous to merging ATAC-seq peaks into a consensus set before re-quantification.
Blocked by: #166 (ORF-level quantification, which defines the ORF catalogue format). Part of --extended_orf_analysis.
Merging strategy
Canonical ORFs (>100 aa)
Two-path approach to handle spliced ORFs:
- Annotated multi-exon CDS ORFs: collapse by shared transcript ID across samples. These are already uniquely identified; coordinate merging would incorrectly merge ORFs sharing genomic span but with different intron structures.
- Single-exon novel intergenic ORFs (class
u): coordinate-based strand-aware merge:
cat sample*.bed | sort -k1,1 -k2,2n | bedtools merge -s > merged_canonical.bed
Near-identical ORFs with caller-specific TIS uncertainty: collapse using 99% amino acid identity + shared stop codon — the criterion used by the GENCODE Ribo-seq ORF catalogue (Ruiz-Orera et al. Nat Biotechnol 2022). Not a nucleotide-window heuristic.
Small ORFs (≤100 aa — the smORF threshold)
The 100 aa / 300 nt cutoff is the field standard: sORFs.org (NAR 2016), GENCODE Ribo-seq catalogue, SmProt v2. After coordinate merge, apply sequence-based deduplication — smORFs in repetitive regions may be called at multiple genomic copies representing the same peptide:
# CD-HIT at 100% identity
cd-hit -i merged_smorfs.faa -o deduped.faa -c 1.00 -n 5 -l 10 -M 4000 -T 8
# Or MMseqs2 (faster for large catalogues; --cluster-mode 2 for short peptides)
mmseqs easy-cluster merged_smorfs.faa result tmp/ --min-seq-id 1.0 --cov-mode 0 -c 1.0 --cluster-mode 2
Recurrence filter (optional)
Require each ORF called in ≥2 samples (or ≥1 per condition) before catalogue inclusion. Reduces single-sample false positives without hard intersection.
Re-quantification
All samples re-quantified against the merged catalogue (issue #166). Samples where an ORF was not called have zero counts — zero is a valid observation, not missing data.
Cross-tool consensus: fuzzy overlap required
Benchmark finding (May 2026): strict coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity — tools assign different start coordinates to the same biological ORF. Do not use strict (chrom, start, end, strand) equality joins for cross-tool merging. Use bedtools-style fuzzy overlap (≥80% reciprocal overlap) or gene-level aggregation.
Output format
Per-tool output harmonisation needed before merging:
- RiboCode:
-b flag → BED12 (cleanest)
- Ribo-TISH:
--blocks → BED-like block coordinates
- These need a normalisation step before catalogue merge
Outputs:
Reference implementation
gencode-riboseqORFs — the tool used to build the GENCODE Phase I Ribo-seq ORF catalogue — implements coordinate + sequence-based collapse and can serve as a reference or be wrapped as a local module.
References
Summary
ORF callers run per-sample producing per-sample ORF lists. For a consistent count matrix, these must be merged into a single run-level catalogue before quantification — analogous to merging ATAC-seq peaks into a consensus set before re-quantification.
Blocked by: #166 (ORF-level quantification, which defines the ORF catalogue format). Part of
--extended_orf_analysis.Merging strategy
Canonical ORFs (>100 aa)
Two-path approach to handle spliced ORFs:
u): coordinate-based strand-aware merge:Near-identical ORFs with caller-specific TIS uncertainty: collapse using 99% amino acid identity + shared stop codon — the criterion used by the GENCODE Ribo-seq ORF catalogue (Ruiz-Orera et al. Nat Biotechnol 2022). Not a nucleotide-window heuristic.
Small ORFs (≤100 aa — the smORF threshold)
The 100 aa / 300 nt cutoff is the field standard: sORFs.org (NAR 2016), GENCODE Ribo-seq catalogue, SmProt v2. After coordinate merge, apply sequence-based deduplication — smORFs in repetitive regions may be called at multiple genomic copies representing the same peptide:
Recurrence filter (optional)
Require each ORF called in ≥2 samples (or ≥1 per condition) before catalogue inclusion. Reduces single-sample false positives without hard intersection.
Re-quantification
All samples re-quantified against the merged catalogue (issue #166). Samples where an ORF was not called have zero counts — zero is a valid observation, not missing data.
Cross-tool consensus: fuzzy overlap required
Benchmark finding (May 2026): strict coordinate-equality cross-tool Jaccard = 0.000–0.057 due to alt-start ambiguity — tools assign different start coordinates to the same biological ORF. Do not use strict (chrom, start, end, strand) equality joins for cross-tool merging. Use bedtools-style fuzzy overlap (≥80% reciprocal overlap) or gene-level aggregation.
Output format
Per-tool output harmonisation needed before merging:
-bflag → BED12 (cleanest)--blocks→ BED-like block coordinatesOutputs:
orf_catalogue.gtf/orf_catalogue.bed12orf_catalogue.faa(amino acid sequences for sequence deduplication)orf_to_gene.tsv(ORF ID → host gene ID, for gene-level pre-aggregation in issue feat: cross-sample ORF catalogue — merge, deduplicate, and re-quantify across all samples #167)Reference implementation
gencode-riboseqORFs — the tool used to build the GENCODE Phase I Ribo-seq ORF catalogue — implements coordinate + sequence-based collapse and can serve as a reference or be wrapped as a local module.
References