Nanoparticle-enriched proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk
This repository contains the analysis code accompanying the paper published in Nature Genetics (2026).
Pietzner M, Williamson A, Hunt KA, Koprulu M, Kohleick L, Demircan K, Genes & Health Research Team, Finer S, Carrasco Zanini J, van Heel DA, Langenberg C. Nanoparticle-enriched mass spectrometry proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk. Nat Genet (2026).
The provided scripts are not designed to work out of the box, as analyses were performed across two different computational environments: the Genes & Health Trusted Research Environment (TRE) for analyses involving individual-level data, and an HPC cluster (BIH, Berlin) for downstream annotation steps. Scripts are intended to illustrate the main analytical steps used to generate the results reported in the manuscript. All file paths have been replaced with <path_to_file>. Data access is available on application via genesandhealth.org.
| Script | Description |
|---|---|
| 01 — Data preparation | |
01_data_prep/01_phenotype_prep_GWAS.R |
Import batch-corrected Seer XT protein group intensities; match to GSA genotyping IDs and covariates; apply inverse-normal transformation; write REGENIE phenotype and covariate files for the common-variant GWAS (TRE) |
01_data_prep/02_phenotype_prep_ExWAS.R |
Map WES sample IDs; apply inverse-normal transformation; write REGENIE phenotype and covariate files for the exome-wide association study (TRE) |
| 02 — Cross-platform proteomics comparison | |
02_cross_platform_correlation/01_protein_mapping_post_processing_HPC.R |
Map protein targets across Seer XT, Olink HT, and SomaLogic 11k by UniProt ID; annotate with genomic coordinates, HPA tissue/cell-type expression, secretome classification; predict platform coverage from biophysical properties using LightGBM classifiers; compute protein group correlation clusters (HPC) |
02_cross_platform_correlation/02_observational_correlations_TRE.R |
Compute Spearman correlations between platforms at protein group and peptide level; derive LOD estimates; compute CVs from repeated samples; run Boruta feature selection and Random Forest models to predict cross-platform concordance (TRE) |
| 03 — Genome-wide association study | |
03_GWAS/01_regenie_GWAS_array_qsub.sh |
SGE array job for REGENIE v3.4.1 two-step GWAS of 5,768 protein groups against TOPMed r3-imputed genotypes (51k GSA array samples, TRE) |
| 04 — Exome-wide association study | |
04_ExWAS/01_regenie_ExWAS_qsub.sh |
SGE array job for REGENIE v3.4.1 single-variant ExWAS and gene-based burden tests against 55k WES call set (TRE) |
| 05 — Fine-mapping and annotation | |
05_fine_mapping/01_collate_results_TRE.R |
Collate genome-wide significant GWAS and ExWAS signals; define merged fine-mapping regions; test for non-additive genetic effects; run peptide-level validation; look up pQTLs in Olink HT and SomaLogic 11k summary statistics (TRE) |
05_fine_mapping/02_annotate_results_HPC.R |
Compile LD proxies; assign novelty tiers against 17 published pQTL studies (including liftOver to GRCh38); score candidate effector genes using eQTL PIP, ABC, Hi-C, protein complexes, ligand-receptor pairs, and proximity; compute pathway enrichment per LD-clump (HPC) |
05_fine_mapping/03_submit_finemapping.sh |
Shell loop to run SuSiE fine-mapping per protein–region pair (TRE) |
05_fine_mapping/04_fine_mapping_fused.R |
Run SuSiE RSS fine-mapping per region using fused GWAS + ExWAS summary statistics and a combined LD matrix from unrelated G&H participants; export 95% credible sets (TRE) |
05_fine_mapping/05_submit_LD.sh |
Shell loop to add LD R² values to fine-mapped output (TRE) |
05_fine_mapping/06_LD_proxies_fine.R |
Compute R² between all fine-mapped variants and each lead credible-set variant; export LD proxy lists (TRE) |
05_fine_mapping/07_vep_annotation_imputed.sh |
Annotate imputed credible-set variants using BCFtools split-vep against a pre-annotated imputed VCF (TRE) |
05_fine_mapping/08_vep_annotation_WES.sh |
Annotate WES credible-set variants using BCFtools split-vep against a pre-annotated WES VCF (TRE) |
05_fine_mapping/09_look_up_Olink.sh |
Look up Seer pQTL lead variants in Olink HT GWAS/ExWAS summary statistics (TRE) |
05_fine_mapping/10_look_up_SomaLogic.sh |
Look up Seer pQTL lead variants in SomaLogic 11k GWAS/ExWAS summary statistics (TRE) |
05_fine_mapping/11_pQTL_lookup_Seer.sh |
Look up all Seer pQTL lead variants across all Seer protein GWAS/ExWAS summary statistics for cross-protein specificity (TRE) |
05_fine_mapping/get_region_dosage.sh |
Extract imputed genotype dosages for a given variant list using PLINK2 (TRE) |
05_fine_mapping/get_region_wes.sh |
Extract WES hard-call genotypes for a given variant list using PLINK2 (TRE) |
| 06 — Colocalisation with disease endpoints | |
06_phenotype_coloc/01_prepare_input_collate_results.R |
Identify pQTLs with evidence in FinnGen R12/UKB meta-analysis; prepare input list for colocalisation; import and filter coloc results (PP.H4 ≥ 0.8, TRE/HPC) |
06_phenotype_coloc/02_lookup_disease_variants.sh |
SLURM array job to extract pQTL proxy variants from 402 FinnGen/UKB disease meta-analysis summary statistics (HPC) |
06_phenotype_coloc/03_submit_coloc.sh |
SLURM array job to run coloc.signals() per protein–disease–region combination (HPC) |
06_phenotype_coloc/04_run_coloc.R |
Run Bayesian colocalisation (coloc, p12 = 5×10⁻⁶) for a single protein–disease–region; generate locus-compare plots for strong colocalisations (HPC) |
| 07 — GWAS Catalog overlap | |
07_gwas_catalog/01_compile_overlap.R |
Intersect pQTL lead variants and LD proxies (r²≥0.6) with the NHGRI-EBI GWAS Catalog (proteomics studies excluded); assign EFO parent terms (HPC) |
| 08 — Phenotype consolidation | |
08_phenotype_consolidation/01_consolidate_evidence.R |
Integrate colocalisation results with GWAS Catalog hits, OMIM Morbid Map, MGI mouse phenotypes, rare-variant burden associations (Jurgens et al.), Open Targets drug annotations, PharmaGKB pharmacogenomics, and HPA tissue/cell-type expression to build a consolidated evidence matrix (HPC) |
| Functions | |
functions/fast_classifier.R |
10-fold cross-validated LightGBM binary classifier (tree.platform.bin.lgbm) and elastic net classifier (tree.platform.bin.glmnet) for platform coverage prediction |
functions/tree_within_platform_corr.R |
10-fold cross-validated Random Forest regression (tree.platform.cor) for predicting cross-platform Spearman correlation |
functions/tree_model_binary.R |
10-fold cross-validated Random Forest binary classifier (tree.platform.bin) for predicting pQTL replication across platforms |
functions/tree_model_predictions.R |
Apply a trained Random Forest regression model to a new dataset and return predicted values |
functions/tree_model_predictions_cross_validation.R |
Apply an ensemble of cross-validation fold models to a new dataset and return median predictions with IQR |
functions/boruta_feature_selection.R |
Boruta wrapper (boruta.selection.dat) for feature selection on continuous or binary outcomes |
functions/aa_composition.R |
Compute per-protein amino acid counts, percentages, and grouped physicochemical properties from UniProt sequences |
functions/parse_fast_pI.R |
Parse isoelectric point predictions from IPC2 FASTA-format output |
functions/plot_enrichment.R |
Plot pathway enrichment results from gprofiler2 with optional gene-overlap-based compression of redundant terms |
functions/plot_locus_compare.R |
Stacked locus-zoom plot comparing pQTL and disease GWAS statistics, coloured by LD with the lead pQTL |
| Software | Version | Purpose |
|---|---|---|
| R | 4.3.x | All statistical analyses |
| REGENIE | 3.4.1 | Genome-wide and exome-wide association testing |
| PLINK2 | 2.00a | Genotype extraction for LD computation |
| BCFtools | 1.19 | VEP variant annotation extraction |
| SuSiE | 0.14.x (susieR) |
Fine-mapping |
| coloc | 5.x | Bayesian colocalisation |
| LightGBM | (lightgbm) |
Platform coverage classification |
Pietzner M, Williamson A, Hunt KA, Koprulu M, Kohleick L, Demircan K, Genes & Health Research Team, Finer S, Carrasco Zanini J, van Heel DA, Langenberg C. Nanoparticle-enriched mass spectrometry proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk. Nat Genet (2026).