wes_chd_ukbb

Meta-analysis of a large-scale wes dataset using Hail

This repository contains a series of pipelines (mostly command-line tools) to analyse the CHD case-control exome cohort linked to the published study: "Assessing the contribution of rare variants to congenital heart disease through a large-scale case-control exome study" (npj Genomic Medicine) .

Most pipelines were adapted from the gnomad repository [1]. Pipelines were built using the python-like library Hail (https://hail.is).

Exome dataset: Congenital Heart Disease (CHD) cases were mainly sequenced as part of an initiative from the German Competence Network for Congenital Heart Defects, the Deciphering Developmental Disorder (DDD) project and the University of Nottingham (UK); controls were sequenced as part of the UK Biobank (UKBB).

Setup

Prerequisites

Linux (Ubuntu 20.04+ recommended)
Miniconda or Python 3.9 + pip
Java 11 on PATH (required by Hail/Spark)
A running Spark standalone cluster or use Hail in local mode

Install with conda (recommended)

conda env create -f environment.yml
conda activate wes_chd_ukbb

Install with pip

pip install -r requirements.txt

Note: pyspark is not listed in the requirements — Hail bundles a compatible Spark version. Installing a separate pyspark causes classpath conflicts.

Run a script

Scripts are run as Python modules from the repository root:

python -m sample_qc.apply_hard_filters

Path configuration

Paths are configured via environment variables. Set them before running any pipeline:

export WES_NFS_DIR="file:///home/ubuntu/data"   # root of your local data mount
export WES_HDFS_DIR="hdfs://spark-master:9820"  # HDFS base URL for output/checkpoint dirs

If these variables are not set, the original production defaults (file:///home/ubuntu/data and hdfs://spark-master:9820) are used, preserving backward compatibility.

Both variables are centralised in utils/config.py — the only file that needs to change when porting the project to a new host or storage backend.

Sample QC

Hard filters: Mark samples with unspecific chromosomal sex, low call rate and/or low coverage. (sample_qc/apply_hard_filters.py)
Population ancestries inferring: Impute sample ancestries using the the 1000 Genomes Phase 3 sequence dataset. (sample_qc/ancestry_inference.py)
Inferring sample relatedness: Identify twins/duplicated samples as well as first- and second-relatives. (sample_qc/relatedness_inference.py)
Platform inference: Assign capture platform to samples using unsupervised clustering. (sample_qc/platform_pca.py)
Platform- and population-specific outliers filtering: Detect sample outliers stratified by population/platforms. (sample_qc/sample_qc.py)
Final sample QC: Mark samples failing QC and generate HailMatrix with high-quality genotypes (dept of coverage >= 10, genotype quality >= 20 and genotype allele balance of heterozygotes > 0.20). (sample_qc/finalise_sample_qc.py)

Variant QC

Hard filters: Mark variants failing hard filters if it a) showed an excess of heterozygotes (inbreeding coefficient < -0.3) and b) contains absence of at least one sample with a high-quality genotype
RF model: Application of a random forest (RF) model to distinguish true variations from potential false positives. (variant_qc/3.train_apply_finalise_RF.py)
VQSR filter: Application of the GATK Variant Quality Score Recalibration (VQSR) tool.
Coverage: Mark variants as covered if it a) is defined in the major capture platforms intervals used in the assembled cohort and b) showed a coverage of 10X or more in at least the 90% of the samples in the gnomAD genome dataset (version 3.1.0).
Final variant QC: Mark variants failing hard, random forest, VQSR and/or coverage filters. (variant_qc/finalise_variant_qc.py)

Burden testing

Gene burden test: Run gene-based case-control burden test (Fisher Exact) stratified by variant functional category and proband syndromic status. (pipelines/gene_burden_fet.py)

Gene-set burden test: Run gene set-based case-control burden test (logistic regression) stratified by variant functional category and proband syndromic status. (pipelines/geneset_burden_logreg.py)

Util scripts

VEP parser: Parse a VCF file annotated with the Variant Effect Predictor (VEP) tool (pipelines/vep_parser.py)

dbNSFP parser: Generate a HailTable from the dbNSFP (version 4.1a) for annotations and downstream analysis (script-utils/parse_dbnsfp_variants.py)

References

[1] Karczewski, KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

Citation

If you use this repository, please cite:

Audain, E., Wilsdon, A. et al. Assessing the contribution of rare variants to congenital heart disease through a large-scale case-control exome study. npj Genom. Med. 11, 30 (2026). https://doi.org/10.1038/s41525-026-00582-z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wes_chd_ukbb

Meta-analysis of a large-scale wes dataset using Hail

Setup

Prerequisites

Install with conda (recommended)

Install with pip

Run a script

Path configuration

Sample QC

Variant QC

Burden testing

Util scripts

References

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
pipelines		pipelines
sample_qc		sample_qc
script-utils		script-utils
submit-scripts		submit-scripts
utils		utils
variant_qc		variant_qc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

wes_chd_ukbb

Meta-analysis of a large-scale wes dataset using Hail

Setup

Prerequisites

Install with conda (recommended)

Install with pip

Run a script

Path configuration

Sample QC

Variant QC

Burden testing

Util scripts

References

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages