Skip to content

feat: add Rp-Bp as Tier-1 ORF caller (opt-in, overnight mode) #169

@pinin4fjords

Description

@pinin4fjords

Summary

Add Rp-Bp as a Tier-1 opt-in ORF caller. Benchmarked alongside RiboCode as the recommended two-caller default combination: RiboCode is permissive and fast; Rp-Bp is Bayesian-strict and slow. Activated by --run_rpbp (default false).

Blocked by: #163 (ribotricer removal must complete first).

Benchmark results

From the nf-core/riboseq benchmark (Krueger + García Bediaga, May 2026; 6 biological replicates, genome-wide):

  • Mean Spearman replicate concordance: 0.893 (Tier-1)
  • Mean Jaccard set overlap: 0.673 (Tier-1)
  • uORF/dORF ratio: 3.4–5.2× (biology-correct)
  • Start-codon composition: 99.94% AUG
  • ORF volume: ~18k–26k per replicate (comparable to RiboCode's ~22k)
  • Runtime: 19h38m–24h11m per replicate (Bayesian MCMC fit dominates)

Software

  • Bioconda: rpbp=4.0.1 (released 2024-11-21) — conda install -c bioconda rpbp
  • PyPI: rpbp==4.0.1 (released 2024-11-20), Python 3.11–3.13
  • GitHub: dieterich-lab/rp-bp, actively maintained (Etienne Boileau), not archived

Implementation

  1. New nf-core module at modules/nf-core/rpbp/. Rp-Bp has two phases:

    • prepare-rpbp-genome (index build, run once per pipeline execution): takes genome FASTA + GTF, outputs a Rp-Bp config YAML and genome index
    • rpbp (per-sample run): takes BAM + config YAML, outputs BED/CSV of predicted ORFs with Bayes factor scores
  2. Pipeline wiring in workflows/riboseq/main.nf:

    if (params.run_rpbp) {
        RPBP_PREPARE_GENOME(ch_fasta, ch_gtf)
        RPBP(ch_genome_bam, RPBP_PREPARE_GENOME.out.config)
    }
  3. --rpbp_config parameter for the required YAML configuration file (genome paths, read-length range, minimum read count per length class). Provide a helper script or pipeline step to auto-generate this from pipeline inputs rather than requiring manual YAML authoring.

  4. Multi-caller merge: update the ORF caller output aggregation (issue feat: wire hybrid GTF (canonical + novel intergenic) into ORF callers #165) to include Rp-Bp calls when --run_rpbp is active.

  5. Documentation: note the ~24h/rep genome-wide runtime expectation prominently so users schedule accordingly.

Notes

  • prepare-rpbp-genome check_programs_exist requires STAR on PATH even when not running alignment — account for this in the Nextflow process environment.
  • Memory: comfortable with default Python/Stan settings at genome-wide scale (no -Xmx flags needed, unlike PRICE).
  • The Bayesian fit is the bottleneck, not I/O — runtime scales with read depth, not genome complexity.

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions