- Splitting large tables into 10k row chunks (preserving headers)
- Extracting FASTA sequences for left and right coordinates using bedtools
- Adding extracted sequences as new columns (lseq, rseq)
- Performing MFE calculation on the extracted sequences (RNAduplex)
- Optionally generating shuffled control sequences for MFE comparisons (uShuffle) and/or flipped arm control (reversing one arm's sequence)
- Concatenating all chunks back into a single table with one header and plotting
- Nextflow >= 25.10.2
- Docker or Singularity
Important: we assume BED-style 0-based half-open coordinates are given in the input.
nextflow run main.nf \
--input samplesheet.tsv \
--fasta /path/to/genome.fa \
--fai /path/to/genome.fa.fai \
--chunk_size <n> \
--shuffled_mfe \
--n_shuffles <n> \
--flipped_arm_mfe \
--outdir resultsSamplesheet format (TSV with header):
sample_id file_path
GM12878_rep1 /path/to/GM12878_rep1.chim.txt
GM12878_rep2 /path/to/GM12878_rep2.chim.txt
nextflow run main.nf \
--input_pattern 'data/*.chim.txt' \
--fasta /path/to/genome.fa \
--outdir results \
<...>--fasta: Path to reference genome FASTA file--fai: Path to reference genome FASTA index
--input: Path to samplesheet (TSV format)--input_pattern: Glob pattern for input files (e.g., "data/*.txt")
--outdir: Output directory (default: './results')--chunk_size: Number of rows per chunk (default: 10000)--shuffled_mfe: Enable MFE calculations for shuffled control sequences (default: false)--n_shuffles: Number of times the sequence is shuffled (default: 100)--klet_shuffles: klet for sequence shuffling (default: 2)--flipped_arm_mfe: Enable MFE calculations for flipped arm control sequences
Input files should be tab-delimited with the following required columns:
lchr ll lr lstrand rchr rl rr rstrand name
Optional columns:
mapq
For each sample, the pipeline produces:
{sample_id}_mfe.tsv: Final table with added columns
Output columns include all original columns plus:
-
lseq: Extracted sequence for left coordinate (strand-aware) -
rseq: Extracted sequence for right coordinate (strand-aware) -
mfe: Observed minimum free energy, MFE (kcal/mol) -
dot_bracket: Dot-bracket representation of the predicted duplex structure for the observed lseq and rseq -
mean_shuffled_mfe: Mean duplex MFE across all successfully evaluated shuffled sequence pairs -
sd_shuffled_mfe: Standard deviation of duplex MFE across all successfully evaluated shuffled sequence pairs -
delta_mfe: Observed MFE minus the mean shuffled MFE -
zscore_mfe: Standardized difference between the observed MFE and the shuffled MFE distribution -
empirical_p_lower: Empirical one-sided p-value for observing an MFE this low or lower relative to the shuffled null distribution -
n_shuffles_ok: Number of shuffled sequence pairs for which duplex MFE was successfully computed -
mfe_lseq_flipped: Duplex MFE after reversing lseq only and folding it against the original rseq -
flipped_lseq_dot_bracket: Dot-bracket structure for the duplex formed by reversed lseq and original rseq -
flipped_lseq_pair: The exact sequence pair used for that fold, formatted asreversed_lseq&original_rseq -
mfe_rseq_flipped: Duplex MFE after reversing rseq only and folding it against the original lseq -
flipped_rseq_dot_bracket: Dot-bracket structure for the duplex formed by original lseq and reversed rseq -
flipped_rseq_pair: The exact sequence pair used for that fold, formatted as original_lseq&reversed_rseq
nextflow run main.nf -profile docker --input ... --fasta ...nextflow run main.nf -profile singularity --input ... --fasta ...Input Tables
↓
SPLIT_TABLE (10k rows per chunk)
↓
PREPARE_BED (convert to BED format)
↓
EXTRACT_SEQUENCES (bedtools getfasta -s)
↓
ADD_SEQUENCES (add lseq, rseq columns)
↓
CALCULATE_MFE or CALCULATE_MFE_CONTROLS (MFE-related columns)
↓
CONCATENATE_TABLES (merge chunks)
↓
PLOT_MFE_SUMMARY (per sample)
↓
Final Output
# Using samplesheet
nextflow run main.nf \
--input samplesheet.tsv \
--fasta hg38.fa \
--chunk_size 10000 \
--outdir results \
-profile dockerconda activate nfcore_tools_34
nextflow run main.nf --fasta /Volumes/lab-ulej/home/users/luscomben/users/iosubi/projects/structurome_blencowe/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa --input ./data/samplesheet.tsv --outdir results -profile docker --chunk_size 10 -resume --fai /Volumes/lab-ulej/home/users/luscomben/users/iosubi/projects/structurome_blencowe/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai --shuffled_mfe --n_shuffles 5 --flipped_arm_mfe
SPLIT_TABLE → [meta, chunk_0001.txt]
[meta, chunk_0002.txt]
[meta, chunk_0003.txt]
↓
PREPARE_BED → [meta, chunk_0001.txt, chunk_0001_left.bed, chunk_0001_right.bed]
[meta, chunk_0002.txt, chunk_0002_left.bed, chunk_0002_right.bed]
[meta, chunk_0003.txt, chunk_0003_left.bed, chunk_0003_right.bed]
↓
EXTRACT_SEQUENCES → [meta, chunk_0001.txt, left.fa, right.fa]
[meta, chunk_0002.txt, left.fa, right.fa]
[meta, chunk_0003.txt, left.fa, right.fa]
↓
ADD_SEQUENCES → [meta, chunk_0001_sequences.tsv]
[meta, chunk_0002_sequences.tsv]
[meta, chunk_0003_sequences.tsv]
etc.