🧬 Pipeline of DNA- and RNA-seq for eQTL/sQTL Mapping and Loci Prioritization

Overview

This repository contains a Nextflow DSL2 pipeline for performing DNA QC, RNA QC, kinship analysis, and data reformatting for use in TensorQTL QTL mapping.

The pipeline includes:

Extracting, indexing, and filtering VCF files
PCA analyses for both genotypes and RNA expression
eQTL and sQTL analyses
Colocalization (coloc) analyses
Plotting and downstream interpretation

Pipeline Overview

---
config:
  layout: dagre
---
flowchart TD
    A["<b>TOPmed.freeze10b.vcf.gz</b>"] --> B["<b>Extract TOPCHef Samples</b>"]
    B -- "MAF > 0.01" --> D["<b>Kinship Analyses &amp; Remove Related Samples</b>"]
    B --> G["<b>LD Prune/Thin/Filter VCF</b>"]
    G --> J["<b>Genome-wide SNP PCA</b>"]
    AA["<b>gene_reads.gct.gz</b>"] --> K["<b>MedRatio Normalize RNA &amp; PCA</b>"] & M["<b>TMM Norm RNA &amp; PCA</b>"]
    K -- Remove RNA outliers --> N["<b>Reformat gene/splicing matrix &amp; covariates</b>"]
    M -- "Remove low-expression Genes & PCA" --> N
    N --> nn["<b>Run cis-QTL Saturation Test</b>"]
    J --> nn
    nn -- Identify best model for maximizing significant (e/s)Genes --> jj["<b>Run nominal cis-QTL for best covariate model</b>"]
    jj --> P["<b>Run Coloc</b>"] & n1["<b>Run trans-QTL for best covariate model</b>"]
    O["<b>Standardize &amp; LiftOver GWAS</b>"] --> P
    P --> Q["<b>Analyze Coloc &amp; Output Candidate Genes</b>"]
    QQQ["<b>Run RegTools</b>"] --> n2["<b>Run LeafCutter</b>"]
    QQ["<b>RNA.bam(s)</b>"] --> QQQ
    n2 --> N
    A@{ shape: card}
    B@{ shape: rounded}
    D@{ shape: rounded}
    G@{ shape: rounded}
    J@{ shape: rounded}
    AA@{ shape: card}
    K@{ shape: rounded}
    M@{ shape: rounded}
    N@{ shape: rounded}
    nn@{ shape: procs}
    jj@{ shape: rounded}
    P@{ shape: rounded}
    n1@{ shape: rounded}
    O@{ shape: rounded}
    Q@{ shape: rounded}
    QQQ@{ shape: rounded}
    n2@{ shape: rounded}
    QQ@{ shape: procs}
     A:::Rose
     B:::Rose
     D:::Rose
     G:::Rose
     J:::Rose
     K:::Sky
     M:::Sky
     N:::Sky
     jj:::Peach
     P:::Peach
     n1:::Peach
     O:::Peach
     Q:::Peach
     QQQ:::Aqua
     n2:::Aqua
     QQ:::Aqua
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Sky stroke-width:1px, stroke-dasharray:none, stroke:#374D7C, fill:#E2EBFF, color:#374D7C
    classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
    classDef Pine stroke-width:1px, stroke-dasharray:none, stroke:#254336, fill:#27654A, color:#FFFFFF
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
    style A color:#000000,fill:#FFCDD2
    style B fill:#FFCDD2,color:#000000
    style D color:#000000
    style G color:#000000
    style J color:#000000
    style AA fill:#BBDEFB,color:#000000
    style K color:#000000
    style M color:#000000
    style N color:#000000
    style nn color:#000000,fill:#E1BEE7
    style jj color:#000000
    style P color:#000000
    style n1 color:#000000
    style O color:#000000
    style Q color:#000000
    style QQQ color:#000000
    style n2 color:#000000
    style QQ color:#000000

Installation & Requirements

You must have Nextflow and Apptainer installed to run this pipeline.

Tested versions:

nextflow/25.04.6
apptainer/1.3.4

Prerequisite Files

1. Metadata - sample and participant information

File	Description
topchef_samples_bcf.txt	List of sample names matching the genomic BCF/VCF (e.g., `NWD100`)
sample_participant_map	Two-column, tab-delimited file mapping RNA to DNA IDs (e.g., `TOR100 NWD100`)
metadata_10_17_2024_CSM.txt	Multi-column table of general experimental metadata
chromosomes	List of human chromosomes (no header, e.g., `chr1`)

2. Create simplified GTF for various coloc analysis/plotting scripts

Download the base GTF from: https://www.gencodegenes.org/human/release_34.html
The nextflow pipeline will generate "collapsed.gtf" and output to "output/genome_files"

# By Connor Murray
# Make streamlined GTF use the following tidyverse image: 
# singularity pull library://connmurr243/wgs/topchef_tidyverse_r.sif

# Libraries
library(tidyverse)
library(data.table)

# Working directory 
setwd("/standard/vol185/cphg_Manichaikul/users/csm6hg/nextflow_dna")

# Read in GTF
gene.gtf_m <- data.table(fread("output/genome_files/collapsed.gtf"))
gene.gtf <- gene.gtf_m[V3=="gene"]                         

# Process the data
gene.gtf[, gene := sub("gene_id*. ", "", V9)]
gene.gtf[, gene := sub(";.*", "", gene)]  # Remove everything after the gene name
gene.gtf[, gene := substr(gene, 2, nchar(gene) - 1)]
gene.gtf[, gene_edit := sub("\\..*", "", gene)]
gene.gtf[, common_gene := sub(".*gene_name", "", V9)]
gene.gtf[, common_gene := sub(";.*", "", common_gene)]
gene.gtf[, common_gene := substr(common_gene, 3, nchar(common_gene) - 1)]
gene.gtf[, len := V5 - V4]
colnames(gene.gtf)[1:5] <- c("chrom", "file", "sec", "start", "stop")

#gene.gtf[, transcript := sub(".*transcript_id*. ", "", V9)]
#gene.gtf[, transcript := sub(";.*", "", transcript)]  # Remove everything after the gene name
#gene.gtf[, transcript := substr(transcript, 2, nchar(transcript) - 1)]

# Output 
saveRDS(gene.gtf %>% select(-c(V6,V8,sec)), 
        "gencode.v34.GRCh38.ERCC.genes.collapsed.streamlined.RDS")

Steps 3-5 require a bit of set up if you are using alternative GWAS statistics that require LiftOver to HG38

3. [Optional] Liftover Files, only necessary if you perform liftover (HG19 --> HG38) on the GWAS statistics

Download the chain file for liftover:

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz

Edit its location in lines 74–75 of the script: standardize_GWAS_eQTL_for_coloc_new.R

4. [Optional] Run CrossMap for Specific Genome Version (HG38 Gencode v34)

# Run crossmap (from: https://github.com/porchard/crossmap-nextflow/tree/master)
nextflow run --results ./output --mismatch 2 --exon_k 75 --utr_k 36 \
  --fasta hg38.fa \
  --gtf gencode.v34.GRCh38.annotation.gtf \
  main.nf \
  -profile slurm -resume -bg

Make a negative mappability mask (to mask low-mappability sites during QTL mapping):

module load htslib

zcat /crossmap-nextflow/output/snp_mappability/snp_mappability_36mer_2mismatch.bed.gz \
  | awk '$4 == 1' \
  | bgzip > snp_mappability_36mer_2mismatch_mapp1.bed.gz

Alternatively, you can download precomputed mappability files from: https://figshare.com/collections/Mappability_Resources/4297352/2

5. [Optional] Colocalization with a Different GWAS

The test setup uses the DCM Jurgens et al., (2024) GWAS: https://api.kpndataregistry.org/api/d/CQyqth, from: https://cvd.hugeamp.org/downloads.html.

Ensure your GWAS summary files:

Contain full nominal statistics (not FDR-adjusted)
Have the following column names, change manually (if necessary!):

Field	Column Name	Example
Base-pair location	`BP`	119394
Chromosome	`CHR`	1
Effect allele frequency	`af`	0.42
p-value	`p_value`	1.2e-8
Standard error	`standard_error`	0.05
Effect size	`beta`	0.21
Effect allele	`A1`	A
Reference allele	`A2`	G
Variant ID	`SNP`	rs12345

Updating/personalizing the `nextflow.config`

This pipeline requires users to update file paths and cluster settings inside the nextflow.config file before running. The configuration file controls:

SLURM resources (CPUs, memory, partition, account)
Directory structure for outputs and scripts
Location of RNA-seq counts, metadata, genome files, and optional GWAS inputs
Paths to all auxiliary reference files

Use the guide below to update each section.

General SLURM & Resource Settings

Modify these fields to match your HPC cluster:

// General parameters
threads = 8 // default number of threads per job
memory = '30 GB'
partition = 'standard'
account = 'manichaikul' // SLURM account for job scheduling

Directory Paths

Update all paths to match your system layout:

// Directories
wd = "/standard/vol185/TOPMed/Freeze_10b/88293/topmed-dcc/exchange/phs001217_TOPMed_WGS_GenSalt/Combined_Study_Data/Genotypes/freeze.10b/phased"
out = "/scratch/csm6hg/nextflow_dna/output"
scripts_dir = "/scratch/csm6hg/nextflow_dna/scripts"

// Raw gene count data - RNA matrix
gene_count_file = "/standard/vol185/TOPMed/TOPCHef/82214/topmed-dcc/exchange/phs002038_TOPMed_TOPCHeF/Omics/RNASeq/release3/TOPMed_Taylor_P4.RNASeQCv2.3.6_gene_reads.gct.gz"

Edit these fields:

wd — directory containing the genomic VCFs
out — where all workflow output will be stored
scripts_dir — directory containing the helper Python/R scripts
gene_count_file — path to the main RNA-seq gene count matrix (GCT or similar)

Metadata Files

// Metadata - sample and participant information
samp = "/scratch/csm6hg/metadata/topchef_samples_bcf.txt"
sample_participant_map = "/scratch/csm6hg/metadata/sample_participant_map"
metadata = "/scratch/csm6hg/metadata/metadata_10_17_2024_CSM.txt"
chrom_file = "/scratch/csm6hg/metadata/chromosomes" // List of human chromosomes, no header

Edit these fields:

samp — list of DNA-level sample names found in the VCF
sample_participant_map — RNA↔DNA ID mapping file
metadata — main project metadata file
chrom_file — list of chromosomes used by the pipeline

Genome Annotation & Mappability Resources

// Script inputs
gtf = "/scratch/csm6hg/genome_files/gencode.v34.GRCh38.annotation.gtf"
gtf_streamlined = "/scratch/csm6hg/genome_files/gencode.v34.GRCh38.ERCC.genes.collapsed.streamlined.RDS"
mapp_file = "/scratch/csm6hg/data/hg38_gene_mappability.txt.gz"
snp_mapp_file = "/scratch/csm6hg/crossmap-nextflow/output/snp-mappability/snp_mappability_36mer_2mismatch_mapp1.bed.gz"

Edit these fields:

gtf — reference genome annotation
gtf_streamlined — collapsed GTF (pipeline-specific)
mapp_file — gene mappability file
snp_mapp_file — SNP mappability BED

Optional: GWAS Inputs for Coloc/Integration

// GWAS inputs - specific to this project and coloc analyses
gwas_levin = "/scratch/csm6hg/data/HF-multiancestry-maf0.01.tsv.gz"
gwas_shah = "/scratch/csm6hg/data/ShahS_31919418_HeartFailure.gz"
gwas_jurgens = "/scratch/csm6hg/data/Jurgens_DCM_GWAS_META_BiobanksOnly.tsv.gz"

Pipeline Usage

Testing of short-runs to see if all process run fine

# Testing a short run for eQTLs (~20 mins)
nextflow run main_process_test.nf -profile slurm -bg

# Testing a short run for sQTLs (~20 mins)
nextflow run modules/splicing/splicing_tesnsorqtl_test.nf -profile slurm -bg

1: Run Main eQTL Pipeline

nextflow run main_process.nf -profile slurm -resume -bg

2: [Optional] Create Coloc-Locus Plots

# Generate gene-by-gene parameter file
Rscript makeCandidateGeneLDParameters.R

If you have a priority gene list:

head candgenes.ldlist.txt

> gene_edit       common_gene     chrom   startLD stopLD  evidence 
  ENSG00000069431 ABCC9   chr12   20797401        22942529        limited 
  ENSG00000159251 ACTC1   chr15   33790230        35795549        high 
  ENSG00000148677 ANKRD1  chr10   89912096        91921276        limited

Run the LD/coloc plotting module:

nextflow run ld.nf -profile slurm -bg

Splicing-QTLs (sQTLs)

1: Run Splicing-QTL Pipeline

nextflow run modules/splicing_modules/splicing_tensorqtl.nf -profile slurm -resume -bg

2: [Optional] Generate Coloc-Locus Plots for sQTLs

library(data.table)
library(tidyverse)

# Read in coloc results and filter candidate sites
coloc <- fread("output/coloc_sqtl/coloc_sqtl_candidates_full.txt") %>% 
  filter(PP.H4 > 0.8) %>%
  select(chrom, minPos, maxPos, gene_raw, gene.x)

# Write output file
write.table(coloc, file="candgenes.spliceldlist_9_16_25.txt", sep="\t", quote=F, row.names=F)

Example output:

> chrom   minPos  maxPos  gene_raw        gene.x
  chr1    727233  1981319 ENSG00000187642 chr1:981173:982065:clu_86_-:ENSG00000187642.9
  chr10   72665997 74661771 ENSG00000166317 chr10:73648879:73649811:clu_9644_-:ENSG00000166317.12

Run the LD plotting step:

nextflow run splicing_modules/splice_ld.nf -profile slurm -bg

Main Scripts and Modules

Nextflow Pipelines

File	Description
`main_process.nf`	Main eQTL/coloc workflow
`modules/mainRNA_flow.nf`	RNA normalization, PCA, and outlier detection
`modules/concatvcf.nf`	Concatenate VCF files
`modules/king.nf`	Run kinship analyses
`modules/SNP_PCA.nf`	Perform SNP PCA and identify outliers
`modules/reformat_eqtl.nf`	Reformat PC covariates and VCF for TensorQTL
`modules/coloc.nf`	Run coloc analysis
`modules/ld.nf`	Generate LD and locus-compare-like plots

Supporting Scripts

Script	Function
`collapse_annotations.py`	Extract longest ORF per transcript
`reformat_TSS_gtf.py`	Reformat GTF for transcription start sites
`medratio_norm_pca.py`	Median ratio normalization and PCA
`tmm_norm_pca_sex.py`	TMM normalization, PCA, and sex inference
`plot_king.R`	Plot kinship results and related individuals
`vcf2gds.R`	Convert VCF → GDS format
`topchef_dna_pca.R`	Generate SNP PCA and ancestry PCs
`reformat_eqtl.R`	Prepare covariates and phenotypes for TensorQTL
`standardize_GWAS_eQTL_for_coloc_new.R`	Standardize GWAS data for coloc
`run_cisNominaleQTL_coloc.R`	Run coloc analysis
`analysis_coloc.R`	Analyze coloc results and generate plots

Configuration

The pipeline can be customized via the nextflow.config file, where you can specify:

Input/output directories
Resource requirements (CPUs, memory, etc.)
Paths to reference data (RNAseq, GWAS, metadata) and chain files

Splicing (LeafCutter) Notes

If using LeafCutter, you must adapt the make_exonList.py script for your GENCODE GTF version. This script outputs a list of exons used by LeafCutter.

import pandas as pd
import qtl.annotation

annot = qtl.annotation.Annotation('gencode.v34.GRCh38.annotation.gtf')
exon_df = pd.DataFrame([[g.chr, e.start_pos, e.end_pos, g.strand, g.id, g.name]
                        for g in annot.genes for e in g.transcripts[0].exons],
                       columns=['chr', 'start', 'end', 'strand', 'gene_id', 'gene_name'])
exon_df.to_csv('gencode.v34.GRCh38.genes.exons.txt.gz', sep='\t', index=False)

QTL Summary Statistics Download

You can download the per-chromosome summary statistics used in the paper from Zenodo: https://doi.org/10.5281/zenodo.17932666
The files are compressed as tar.gz and contain the standard format of QTL summary statistics.

Contributing

Contributions are welcome! I’m still learning Nextflow and open to improvements. Please open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Citation

TBD — you may cite this repository as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.vscode		.vscode
definition		definition
modules		modules
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.nextflow.pid		.nextflow.pid
README.md		README.md
chromosomes		chromosomes
list.eqtls		list.eqtls
main_process.nf		main_process.nf
main_process_test.nf		main_process_test.nf
nextflow.config		nextflow.config

connor122721/nf-eqtls

Folders and files

Latest commit

History

Repository files navigation

🧬 Pipeline of DNA- and RNA-seq for eQTL/sQTL Mapping and Loci Prioritization

Overview

Pipeline Overview

Installation & Requirements

Prerequisite Files

1. Metadata - sample and participant information

2. Create simplified GTF for various coloc analysis/plotting scripts

Steps 3-5 require a bit of set up if you are using alternative GWAS statistics that require LiftOver to HG38

3. [Optional] Liftover Files, only necessary if you perform liftover (HG19 --> HG38) on the GWAS statistics

4. [Optional] Run CrossMap for Specific Genome Version (HG38 Gencode v34)

5. [Optional] Colocalization with a Different GWAS

Updating/personalizing the nextflow.config

General SLURM & Resource Settings

Directory Paths

Metadata Files

Genome Annotation & Mappability Resources

Optional: GWAS Inputs for Coloc/Integration

Pipeline Usage

Testing of short-runs to see if all process run fine

1: Run Main eQTL Pipeline

2: [Optional] Create Coloc-Locus Plots

Splicing-QTLs (sQTLs)

1: Run Splicing-QTL Pipeline

2: [Optional] Generate Coloc-Locus Plots for sQTLs

Main Scripts and Modules

Nextflow Pipelines

Supporting Scripts

Configuration

Splicing (LeafCutter) Notes

QTL Summary Statistics Download

Contributing

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Updating/personalizing the `nextflow.config`

Packages