Chimeric Interaction Sequence Extraction and MFE Computation Pipeline

This Nextflow pipeline processes chimeric interaction tables by:

Splitting large tables into 10k row chunks (preserving headers)
Extracting FASTA sequences for left and right coordinates using bedtools
Adding extracted sequences as new columns (lseq, rseq)
Performing MFE calculation on the extracted sequences (RNAduplex)
Optionally generating shuffled control sequences for MFE comparisons (uShuffle) and/or flipped arm control (reversing one arm's sequence)
Concatenating all chunks back into a single table with one header and plotting

Requirements

Nextflow >= 25.10.2
Docker or Singularity

Quick Start

Important: we assume BED-style 0-based half-open coordinates are given in the input.

Option 1: Using a samplesheet

nextflow run main.nf \
  --input samplesheet.tsv \
  --fasta /path/to/genome.fa \
  --fai /path/to/genome.fa.fai \
  --chunk_size <n> \
  --shuffled_mfe \
  --n_shuffles <n> \
  --flipped_arm_mfe \
  --outdir results

Samplesheet format (TSV with header):

sample_id	file_path
GM12878_rep1	/path/to/GM12878_rep1.chim.txt
GM12878_rep2	/path/to/GM12878_rep2.chim.txt

Option 2: Using a glob pattern

nextflow run main.nf \
  --input_pattern 'data/*.chim.txt' \
  --fasta /path/to/genome.fa \
  --outdir results \
  <...>

Parameters

Required

--fasta: Path to reference genome FASTA file
--fai: Path to reference genome FASTA index

Input (choose one)

--input: Path to samplesheet (TSV format)
--input_pattern: Glob pattern for input files (e.g., "data/*.txt")

Optional

--outdir: Output directory (default: './results')
--chunk_size: Number of rows per chunk (default: 10000)
--shuffled_mfe: Enable MFE calculations for shuffled control sequences (default: false)
--n_shuffles: Number of times the sequence is shuffled (default: 100)
--klet_shuffles: klet for sequence shuffling (default: 2)
--flipped_arm_mfe: Enable MFE calculations for flipped arm control sequences

Input File Format

Input files should be tab-delimited with the following required columns:

lchr	ll	lr	lstrand	rchr	rl	rr	rstrand	name

Optional columns:

mapq

Output

For each sample, the pipeline produces:

{sample_id}_mfe.tsv: Final table with added columns

Output columns include all original columns plus:

lseq: Extracted sequence for left coordinate (strand-aware)
rseq: Extracted sequence for right coordinate (strand-aware)
mfe : Observed minimum free energy, MFE (kcal/mol)
dot_bracket: Dot-bracket representation of the predicted duplex structure for the observed lseq and rseq
mean_shuffled_mfe: Mean duplex MFE across all successfully evaluated shuffled sequence pairs
sd_shuffled_mfe: Standard deviation of duplex MFE across all successfully evaluated shuffled sequence pairs
delta_mfe: Observed MFE minus the mean shuffled MFE
zscore_mfe: Standardized difference between the observed MFE and the shuffled MFE distribution
empirical_p_lower: Empirical one-sided p-value for observing an MFE this low or lower relative to the shuffled null distribution
n_shuffles_ok: Number of shuffled sequence pairs for which duplex MFE was successfully computed
mfe_lseq_flipped: Duplex MFE after reversing lseq only and folding it against the original rseq
flipped_lseq_dot_bracket: Dot-bracket structure for the duplex formed by reversed lseq and original rseq
flipped_lseq_pair: The exact sequence pair used for that fold, formatted as reversed_lseq&original_rseq
mfe_rseq_flipped: Duplex MFE after reversing rseq only and folding it against the original lseq
flipped_rseq_dot_bracket: Dot-bracket structure for the duplex formed by original lseq and reversed rseq
flipped_rseq_pair: The exact sequence pair used for that fold, formatted as original_lseq&reversed_rseq

Execution Profiles

Docker

nextflow run main.nf -profile docker --input ... --fasta ...

Singularity

nextflow run main.nf -profile singularity --input ... --fasta ...

Pipeline Overview

Input Tables
    ↓
SPLIT_TABLE (10k rows per chunk)
    ↓
PREPARE_BED (convert to BED format)
    ↓
EXTRACT_SEQUENCES (bedtools getfasta -s)
    ↓
ADD_SEQUENCES (add lseq, rseq columns)
    ↓
CALCULATE_MFE or CALCULATE_MFE_CONTROLS (MFE-related columns)
    ↓
CONCATENATE_TABLES (merge chunks)
    ↓
PLOT_MFE_SUMMARY (per sample)
    ↓
Final Output

Example

# Using samplesheet
nextflow run main.nf \
  --input samplesheet.tsv \
  --fasta hg38.fa \
  --chunk_size 10000 \
  --outdir results \
  -profile docker

Notes to self

Settings for local testing

conda activate nfcore_tools_34

nextflow run main.nf --fasta /Volumes/lab-ulej/home/users/luscomben/users/iosubi/projects/structurome_blencowe/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa --input ./data/samplesheet.tsv --outdir results -profile docker --chunk_size 10 -resume --fai /Volumes/lab-ulej/home/users/luscomben/users/iosubi/projects/structurome_blencowe/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai --shuffled_mfe --n_shuffles 5 --flipped_arm_mfe

Data flow with chunk matching

SPLIT_TABLE → [meta, chunk_0001.txt]
              [meta, chunk_0002.txt]
              [meta, chunk_0003.txt]
                     ↓
PREPARE_BED → [meta, chunk_0001.txt, chunk_0001_left.bed, chunk_0001_right.bed]
              [meta, chunk_0002.txt, chunk_0002_left.bed, chunk_0002_right.bed]
              [meta, chunk_0003.txt, chunk_0003_left.bed, chunk_0003_right.bed]
                     ↓
EXTRACT_SEQUENCES → [meta, chunk_0001.txt, left.fa, right.fa]
                    [meta, chunk_0002.txt, left.fa, right.fa]
                    [meta, chunk_0003.txt, left.fa, right.fa]
                     ↓
ADD_SEQUENCES → [meta, chunk_0001_sequences.tsv]
                [meta, chunk_0002_sequences.tsv]
                [meta, chunk_0003_sequences.tsv]

etc.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
bin		bin
conf		conf
modules/local		modules/local
.gitignore		.gitignore
PIPELINE_SUMMARY.md		PIPELINE_SUMMARY.md
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
samplesheet.example.tsv		samplesheet.example.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chimeric Interaction Sequence Extraction and MFE Computation Pipeline

This Nextflow pipeline processes chimeric interaction tables by:

Requirements

Quick Start

Option 1: Using a samplesheet

Option 2: Using a glob pattern

Parameters

Required

Input (choose one)

Optional

Input File Format

Output

Execution Profiles

Docker

Singularity

Pipeline Overview

Example

Notes to self

Settings for local testing

Data flow with chunk matching

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chimeric Interaction Sequence Extraction and MFE Computation Pipeline

This Nextflow pipeline processes chimeric interaction tables by:

Requirements

Quick Start

Option 1: Using a samplesheet

Option 2: Using a glob pattern

Parameters

Required

Input (choose one)

Optional

Input File Format

Output

Execution Profiles

Docker

Singularity

Pipeline Overview

Example

Notes to self

Settings for local testing

Data flow with chunk matching

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages