Skip to content

comp-med/seer-pqtl-elgh

Repository files navigation

Nanoparticle-enriched proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk

This repository contains the analysis code accompanying the paper published in Nature Genetics (2026).

Pietzner M, Williamson A, Hunt KA, Koprulu M, Kohleick L, Demircan K, Genes & Health Research Team, Finer S, Carrasco Zanini J, van Heel DA, Langenberg C. Nanoparticle-enriched mass spectrometry proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk. Nat Genet (2026).


Note

The provided scripts are not designed to work out of the box, as analyses were performed across two different computational environments: the Genes & Health Trusted Research Environment (TRE) for analyses involving individual-level data, and an HPC cluster (BIH, Berlin) for downstream annotation steps. Scripts are intended to illustrate the main analytical steps used to generate the results reported in the manuscript. All file paths have been replaced with <path_to_file>. Data access is available on application via genesandhealth.org.


Project structure

Script Description
01 — Data preparation
01_data_prep/01_phenotype_prep_GWAS.R Import batch-corrected Seer XT protein group intensities; match to GSA genotyping IDs and covariates; apply inverse-normal transformation; write REGENIE phenotype and covariate files for the common-variant GWAS (TRE)
01_data_prep/02_phenotype_prep_ExWAS.R Map WES sample IDs; apply inverse-normal transformation; write REGENIE phenotype and covariate files for the exome-wide association study (TRE)
02 — Cross-platform proteomics comparison
02_cross_platform_correlation/01_protein_mapping_post_processing_HPC.R Map protein targets across Seer XT, Olink HT, and SomaLogic 11k by UniProt ID; annotate with genomic coordinates, HPA tissue/cell-type expression, secretome classification; predict platform coverage from biophysical properties using LightGBM classifiers; compute protein group correlation clusters (HPC)
02_cross_platform_correlation/02_observational_correlations_TRE.R Compute Spearman correlations between platforms at protein group and peptide level; derive LOD estimates; compute CVs from repeated samples; run Boruta feature selection and Random Forest models to predict cross-platform concordance (TRE)
03 — Genome-wide association study
03_GWAS/01_regenie_GWAS_array_qsub.sh SGE array job for REGENIE v3.4.1 two-step GWAS of 5,768 protein groups against TOPMed r3-imputed genotypes (51k GSA array samples, TRE)
04 — Exome-wide association study
04_ExWAS/01_regenie_ExWAS_qsub.sh SGE array job for REGENIE v3.4.1 single-variant ExWAS and gene-based burden tests against 55k WES call set (TRE)
05 — Fine-mapping and annotation
05_fine_mapping/01_collate_results_TRE.R Collate genome-wide significant GWAS and ExWAS signals; define merged fine-mapping regions; test for non-additive genetic effects; run peptide-level validation; look up pQTLs in Olink HT and SomaLogic 11k summary statistics (TRE)
05_fine_mapping/02_annotate_results_HPC.R Compile LD proxies; assign novelty tiers against 17 published pQTL studies (including liftOver to GRCh38); score candidate effector genes using eQTL PIP, ABC, Hi-C, protein complexes, ligand-receptor pairs, and proximity; compute pathway enrichment per LD-clump (HPC)
05_fine_mapping/03_submit_finemapping.sh Shell loop to run SuSiE fine-mapping per protein–region pair (TRE)
05_fine_mapping/04_fine_mapping_fused.R Run SuSiE RSS fine-mapping per region using fused GWAS + ExWAS summary statistics and a combined LD matrix from unrelated G&H participants; export 95% credible sets (TRE)
05_fine_mapping/05_submit_LD.sh Shell loop to add LD R² values to fine-mapped output (TRE)
05_fine_mapping/06_LD_proxies_fine.R Compute R² between all fine-mapped variants and each lead credible-set variant; export LD proxy lists (TRE)
05_fine_mapping/07_vep_annotation_imputed.sh Annotate imputed credible-set variants using BCFtools split-vep against a pre-annotated imputed VCF (TRE)
05_fine_mapping/08_vep_annotation_WES.sh Annotate WES credible-set variants using BCFtools split-vep against a pre-annotated WES VCF (TRE)
05_fine_mapping/09_look_up_Olink.sh Look up Seer pQTL lead variants in Olink HT GWAS/ExWAS summary statistics (TRE)
05_fine_mapping/10_look_up_SomaLogic.sh Look up Seer pQTL lead variants in SomaLogic 11k GWAS/ExWAS summary statistics (TRE)
05_fine_mapping/11_pQTL_lookup_Seer.sh Look up all Seer pQTL lead variants across all Seer protein GWAS/ExWAS summary statistics for cross-protein specificity (TRE)
05_fine_mapping/get_region_dosage.sh Extract imputed genotype dosages for a given variant list using PLINK2 (TRE)
05_fine_mapping/get_region_wes.sh Extract WES hard-call genotypes for a given variant list using PLINK2 (TRE)
06 — Colocalisation with disease endpoints
06_phenotype_coloc/01_prepare_input_collate_results.R Identify pQTLs with evidence in FinnGen R12/UKB meta-analysis; prepare input list for colocalisation; import and filter coloc results (PP.H4 ≥ 0.8, TRE/HPC)
06_phenotype_coloc/02_lookup_disease_variants.sh SLURM array job to extract pQTL proxy variants from 402 FinnGen/UKB disease meta-analysis summary statistics (HPC)
06_phenotype_coloc/03_submit_coloc.sh SLURM array job to run coloc.signals() per protein–disease–region combination (HPC)
06_phenotype_coloc/04_run_coloc.R Run Bayesian colocalisation (coloc, p12 = 5×10⁻⁶) for a single protein–disease–region; generate locus-compare plots for strong colocalisations (HPC)
07 — GWAS Catalog overlap
07_gwas_catalog/01_compile_overlap.R Intersect pQTL lead variants and LD proxies (r²≥0.6) with the NHGRI-EBI GWAS Catalog (proteomics studies excluded); assign EFO parent terms (HPC)
08 — Phenotype consolidation
08_phenotype_consolidation/01_consolidate_evidence.R Integrate colocalisation results with GWAS Catalog hits, OMIM Morbid Map, MGI mouse phenotypes, rare-variant burden associations (Jurgens et al.), Open Targets drug annotations, PharmaGKB pharmacogenomics, and HPA tissue/cell-type expression to build a consolidated evidence matrix (HPC)
Functions
functions/fast_classifier.R 10-fold cross-validated LightGBM binary classifier (tree.platform.bin.lgbm) and elastic net classifier (tree.platform.bin.glmnet) for platform coverage prediction
functions/tree_within_platform_corr.R 10-fold cross-validated Random Forest regression (tree.platform.cor) for predicting cross-platform Spearman correlation
functions/tree_model_binary.R 10-fold cross-validated Random Forest binary classifier (tree.platform.bin) for predicting pQTL replication across platforms
functions/tree_model_predictions.R Apply a trained Random Forest regression model to a new dataset and return predicted values
functions/tree_model_predictions_cross_validation.R Apply an ensemble of cross-validation fold models to a new dataset and return median predictions with IQR
functions/boruta_feature_selection.R Boruta wrapper (boruta.selection.dat) for feature selection on continuous or binary outcomes
functions/aa_composition.R Compute per-protein amino acid counts, percentages, and grouped physicochemical properties from UniProt sequences
functions/parse_fast_pI.R Parse isoelectric point predictions from IPC2 FASTA-format output
functions/plot_enrichment.R Plot pathway enrichment results from gprofiler2 with optional gene-overlap-based compression of redundant terms
functions/plot_locus_compare.R Stacked locus-zoom plot comparing pQTL and disease GWAS statistics, coloured by LD with the lead pQTL

Software

Software Version Purpose
R 4.3.x All statistical analyses
REGENIE 3.4.1 Genome-wide and exome-wide association testing
PLINK2 2.00a Genotype extraction for LD computation
BCFtools 1.19 VEP variant annotation extraction
SuSiE 0.14.x (susieR) Fine-mapping
coloc 5.x Bayesian colocalisation
LightGBM (lightgbm) Platform coverage classification

Citation

Pietzner M, Williamson A, Hunt KA, Koprulu M, Kohleick L, Demircan K, Genes & Health Research Team, Finer S, Carrasco Zanini J, van Heel DA, Langenberg C. Nanoparticle-enriched mass spectrometry proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk. Nat Genet (2026).

About

Analysis code for Pietzner et al. Nature Genetics 2026

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors