Skip to content

iraiosub/nf-mfe

Repository files navigation

Chimeric Interaction Sequence Extraction and MFE Computation Pipeline

This Nextflow pipeline processes chimeric interaction tables by:

  1. Splitting large tables into 10k row chunks (preserving headers)
  2. Extracting FASTA sequences for left and right coordinates using bedtools
  3. Adding extracted sequences as new columns (lseq, rseq)
  4. Performing MFE calculation on the extracted sequences (RNAduplex)
  5. Optionally generating shuffled control sequences for MFE comparisons (uShuffle) and/or flipped arm control (reversing one arm's sequence)
  6. Concatenating all chunks back into a single table with one header and plotting

Pipeline diagram

Requirements

  • Nextflow >= 25.10.2
  • Docker or Singularity

Quick Start

Important: we assume BED-style 0-based half-open coordinates are given in the input.

Option 1: Using a samplesheet

nextflow run main.nf \
  --input samplesheet.tsv \
  --fasta /path/to/genome.fa \
  --fai /path/to/genome.fa.fai \
  --chunk_size <n> \
  --shuffled_mfe \
  --n_shuffles <n> \
  --flipped_arm_mfe \
  --outdir results

Samplesheet format (TSV with header):

sample_id	file_path
GM12878_rep1	/path/to/GM12878_rep1.chim.txt
GM12878_rep2	/path/to/GM12878_rep2.chim.txt

Option 2: Using a glob pattern

nextflow run main.nf \
  --input_pattern 'data/*.chim.txt' \
  --fasta /path/to/genome.fa \
  --outdir results \
  <...>

Parameters

Required

  • --fasta: Path to reference genome FASTA file
  • --fai: Path to reference genome FASTA index

Input (choose one)

  • --input: Path to samplesheet (TSV format)
  • --input_pattern: Glob pattern for input files (e.g., "data/*.txt")

Optional

  • --outdir: Output directory (default: './results')
  • --chunk_size: Number of rows per chunk (default: 10000)
  • --shuffled_mfe: Enable MFE calculations for shuffled control sequences (default: false)
  • --n_shuffles: Number of times the sequence is shuffled (default: 100)
  • --klet_shuffles: klet for sequence shuffling (default: 2)
  • --flipped_arm_mfe: Enable MFE calculations for flipped arm control sequences

Input File Format

Input files should be tab-delimited with the following required columns:

lchr	ll	lr	lstrand	rchr	rl	rr	rstrand	name

Optional columns:

mapq

Output

For each sample, the pipeline produces:

  • {sample_id}_mfe.tsv: Final table with added columns

Output columns include all original columns plus:

  • lseq: Extracted sequence for left coordinate (strand-aware)

  • rseq: Extracted sequence for right coordinate (strand-aware)

  • mfe : Observed minimum free energy, MFE (kcal/mol)

  • dot_bracket: Dot-bracket representation of the predicted duplex structure for the observed lseq and rseq

  • mean_shuffled_mfe: Mean duplex MFE across all successfully evaluated shuffled sequence pairs

  • sd_shuffled_mfe: Standard deviation of duplex MFE across all successfully evaluated shuffled sequence pairs

  • delta_mfe: Observed MFE minus the mean shuffled MFE

  • zscore_mfe: Standardized difference between the observed MFE and the shuffled MFE distribution

  • empirical_p_lower: Empirical one-sided p-value for observing an MFE this low or lower relative to the shuffled null distribution

  • n_shuffles_ok: Number of shuffled sequence pairs for which duplex MFE was successfully computed

  • mfe_lseq_flipped: Duplex MFE after reversing lseq only and folding it against the original rseq

  • flipped_lseq_dot_bracket: Dot-bracket structure for the duplex formed by reversed lseq and original rseq

  • flipped_lseq_pair: The exact sequence pair used for that fold, formatted as reversed_lseq&original_rseq

  • mfe_rseq_flipped: Duplex MFE after reversing rseq only and folding it against the original lseq

  • flipped_rseq_dot_bracket: Dot-bracket structure for the duplex formed by original lseq and reversed rseq

  • flipped_rseq_pair: The exact sequence pair used for that fold, formatted as original_lseq&reversed_rseq

Execution Profiles

Docker

nextflow run main.nf -profile docker --input ... --fasta ...

Singularity

nextflow run main.nf -profile singularity --input ... --fasta ...

Pipeline Overview

Input Tables
    ↓
SPLIT_TABLE (10k rows per chunk)
    ↓
PREPARE_BED (convert to BED format)
    ↓
EXTRACT_SEQUENCES (bedtools getfasta -s)
    ↓
ADD_SEQUENCES (add lseq, rseq columns)
    ↓
CALCULATE_MFE or CALCULATE_MFE_CONTROLS (MFE-related columns)
    ↓
CONCATENATE_TABLES (merge chunks)
    ↓
PLOT_MFE_SUMMARY (per sample)
    ↓
Final Output

Example

# Using samplesheet
nextflow run main.nf \
  --input samplesheet.tsv \
  --fasta hg38.fa \
  --chunk_size 10000 \
  --outdir results \
  -profile docker

Notes to self

Settings for local testing

conda activate nfcore_tools_34

nextflow run main.nf --fasta /Volumes/lab-ulej/home/users/luscomben/users/iosubi/projects/structurome_blencowe/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa --input ./data/samplesheet.tsv --outdir results -profile docker --chunk_size 10 -resume --fai /Volumes/lab-ulej/home/users/luscomben/users/iosubi/projects/structurome_blencowe/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai --shuffled_mfe --n_shuffles 5 --flipped_arm_mfe

Data flow with chunk matching

SPLIT_TABLE → [meta, chunk_0001.txt]
              [meta, chunk_0002.txt]
              [meta, chunk_0003.txt]
                     ↓
PREPARE_BED → [meta, chunk_0001.txt, chunk_0001_left.bed, chunk_0001_right.bed]
              [meta, chunk_0002.txt, chunk_0002_left.bed, chunk_0002_right.bed]
              [meta, chunk_0003.txt, chunk_0003_left.bed, chunk_0003_right.bed]
                     ↓
EXTRACT_SEQUENCES → [meta, chunk_0001.txt, left.fa, right.fa]
                    [meta, chunk_0002.txt, left.fa, right.fa]
                    [meta, chunk_0003.txt, left.fa, right.fa]
                     ↓
ADD_SEQUENCES → [meta, chunk_0001_sequences.tsv]
                [meta, chunk_0002_sequences.tsv]
                [meta, chunk_0003_sequences.tsv]

etc.

About

A Nextflow DSL2 pipeline to extract sequences from chimeric read tables and compute MFE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors