Nanoparticle-enriched proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk

This repository contains the analysis code accompanying the paper published in Nature Genetics (2026).

Pietzner M, Williamson A, Hunt KA, Koprulu M, Kohleick L, Demircan K, Genes & Health Research Team, Finer S, Carrasco Zanini J, van Heel DA, Langenberg C. Nanoparticle-enriched mass spectrometry proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk. Nat Genet (2026).

Note

The provided scripts are not designed to work out of the box, as analyses were performed across two different computational environments: the Genes & Health Trusted Research Environment (TRE) for analyses involving individual-level data, and an HPC cluster (BIH, Berlin) for downstream annotation steps. Scripts are intended to illustrate the main analytical steps used to generate the results reported in the manuscript. All file paths have been replaced with <path_to_file>. Data access is available on application via genesandhealth.org.

Project structure

Script	Description
01 — Data preparation
`01_data_prep/01_phenotype_prep_GWAS.R`	Import batch-corrected Seer XT protein group intensities; match to GSA genotyping IDs and covariates; apply inverse-normal transformation; write REGENIE phenotype and covariate files for the common-variant GWAS (TRE)
`01_data_prep/02_phenotype_prep_ExWAS.R`	Map WES sample IDs; apply inverse-normal transformation; write REGENIE phenotype and covariate files for the exome-wide association study (TRE)
02 — Cross-platform proteomics comparison
`02_cross_platform_correlation/01_protein_mapping_post_processing_HPC.R`	Map protein targets across Seer XT, Olink HT, and SomaLogic 11k by UniProt ID; annotate with genomic coordinates, HPA tissue/cell-type expression, secretome classification; predict platform coverage from biophysical properties using LightGBM classifiers; compute protein group correlation clusters (HPC)
`02_cross_platform_correlation/02_observational_correlations_TRE.R`	Compute Spearman correlations between platforms at protein group and peptide level; derive LOD estimates; compute CVs from repeated samples; run Boruta feature selection and Random Forest models to predict cross-platform concordance (TRE)
03 — Genome-wide association study
`03_GWAS/01_regenie_GWAS_array_qsub.sh`	SGE array job for REGENIE v3.4.1 two-step GWAS of 5,768 protein groups against TOPMed r3-imputed genotypes (51k GSA array samples, TRE)
04 — Exome-wide association study
`04_ExWAS/01_regenie_ExWAS_qsub.sh`	SGE array job for REGENIE v3.4.1 single-variant ExWAS and gene-based burden tests against 55k WES call set (TRE)
05 — Fine-mapping and annotation
`05_fine_mapping/01_collate_results_TRE.R`	Collate genome-wide significant GWAS and ExWAS signals; define merged fine-mapping regions; test for non-additive genetic effects; run peptide-level validation; look up pQTLs in Olink HT and SomaLogic 11k summary statistics (TRE)
`05_fine_mapping/02_annotate_results_HPC.R`	Compile LD proxies; assign novelty tiers against 17 published pQTL studies (including liftOver to GRCh38); score candidate effector genes using eQTL PIP, ABC, Hi-C, protein complexes, ligand-receptor pairs, and proximity; compute pathway enrichment per LD-clump (HPC)
`05_fine_mapping/03_submit_finemapping.sh`	Shell loop to run SuSiE fine-mapping per protein–region pair (TRE)
`05_fine_mapping/04_fine_mapping_fused.R`	Run SuSiE RSS fine-mapping per region using fused GWAS + ExWAS summary statistics and a combined LD matrix from unrelated G&H participants; export 95% credible sets (TRE)
`05_fine_mapping/05_submit_LD.sh`	Shell loop to add LD R² values to fine-mapped output (TRE)
`05_fine_mapping/06_LD_proxies_fine.R`	Compute R² between all fine-mapped variants and each lead credible-set variant; export LD proxy lists (TRE)
`05_fine_mapping/07_vep_annotation_imputed.sh`	Annotate imputed credible-set variants using BCFtools split-vep against a pre-annotated imputed VCF (TRE)
`05_fine_mapping/08_vep_annotation_WES.sh`	Annotate WES credible-set variants using BCFtools split-vep against a pre-annotated WES VCF (TRE)
`05_fine_mapping/09_look_up_Olink.sh`	Look up Seer pQTL lead variants in Olink HT GWAS/ExWAS summary statistics (TRE)
`05_fine_mapping/10_look_up_SomaLogic.sh`	Look up Seer pQTL lead variants in SomaLogic 11k GWAS/ExWAS summary statistics (TRE)
`05_fine_mapping/11_pQTL_lookup_Seer.sh`	Look up all Seer pQTL lead variants across all Seer protein GWAS/ExWAS summary statistics for cross-protein specificity (TRE)
`05_fine_mapping/get_region_dosage.sh`	Extract imputed genotype dosages for a given variant list using PLINK2 (TRE)
`05_fine_mapping/get_region_wes.sh`	Extract WES hard-call genotypes for a given variant list using PLINK2 (TRE)
06 — Colocalisation with disease endpoints
`06_phenotype_coloc/01_prepare_input_collate_results.R`	Identify pQTLs with evidence in FinnGen R12/UKB meta-analysis; prepare input list for colocalisation; import and filter coloc results (PP.H4 ≥ 0.8, TRE/HPC)
`06_phenotype_coloc/02_lookup_disease_variants.sh`	SLURM array job to extract pQTL proxy variants from 402 FinnGen/UKB disease meta-analysis summary statistics (HPC)
`06_phenotype_coloc/03_submit_coloc.sh`	SLURM array job to run `coloc.signals()` per protein–disease–region combination (HPC)
`06_phenotype_coloc/04_run_coloc.R`	Run Bayesian colocalisation (`coloc`, p12 = 5×10⁻⁶) for a single protein–disease–region; generate locus-compare plots for strong colocalisations (HPC)
07 — GWAS Catalog overlap
`07_gwas_catalog/01_compile_overlap.R`	Intersect pQTL lead variants and LD proxies (r²≥0.6) with the NHGRI-EBI GWAS Catalog (proteomics studies excluded); assign EFO parent terms (HPC)
08 — Phenotype consolidation
`08_phenotype_consolidation/01_consolidate_evidence.R`	Integrate colocalisation results with GWAS Catalog hits, OMIM Morbid Map, MGI mouse phenotypes, rare-variant burden associations (Jurgens et al.), Open Targets drug annotations, PharmaGKB pharmacogenomics, and HPA tissue/cell-type expression to build a consolidated evidence matrix (HPC)
Functions
`functions/fast_classifier.R`	10-fold cross-validated LightGBM binary classifier (`tree.platform.bin.lgbm`) and elastic net classifier (`tree.platform.bin.glmnet`) for platform coverage prediction
`functions/tree_within_platform_corr.R`	10-fold cross-validated Random Forest regression (`tree.platform.cor`) for predicting cross-platform Spearman correlation
`functions/tree_model_binary.R`	10-fold cross-validated Random Forest binary classifier (`tree.platform.bin`) for predicting pQTL replication across platforms
`functions/tree_model_predictions.R`	Apply a trained Random Forest regression model to a new dataset and return predicted values
`functions/tree_model_predictions_cross_validation.R`	Apply an ensemble of cross-validation fold models to a new dataset and return median predictions with IQR
`functions/boruta_feature_selection.R`	Boruta wrapper (`boruta.selection.dat`) for feature selection on continuous or binary outcomes
`functions/aa_composition.R`	Compute per-protein amino acid counts, percentages, and grouped physicochemical properties from UniProt sequences
`functions/parse_fast_pI.R`	Parse isoelectric point predictions from IPC2 FASTA-format output
`functions/plot_enrichment.R`	Plot pathway enrichment results from `gprofiler2` with optional gene-overlap-based compression of redundant terms
`functions/plot_locus_compare.R`	Stacked locus-zoom plot comparing pQTL and disease GWAS statistics, coloured by LD with the lead pQTL

Software

Software	Version	Purpose
R	4.3.x	All statistical analyses
REGENIE	3.4.1	Genome-wide and exome-wide association testing
PLINK2	2.00a	Genotype extraction for LD computation
BCFtools	1.19	VEP variant annotation extraction
SuSiE	0.14.x (`susieR`)	Fine-mapping
coloc	5.x	Bayesian colocalisation
LightGBM	(`lightgbm`)	Platform coverage classification

Citation

Pietzner M, Williamson A, Hunt KA, Koprulu M, Kohleick L, Demircan K, Genes & Health Research Team, Finer S, Carrasco Zanini J, van Heel DA, Langenberg C. Nanoparticle-enriched mass spectrometry proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk. Nat Genet (2026).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nanoparticle-enriched proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk

Note

Project structure

Software

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
01_data_prep		01_data_prep
02_cross_platform_correlation		02_cross_platform_correlation
03_GWAS		03_GWAS
04_ExWAS		04_ExWAS
05_fine_mapping		05_fine_mapping
06_phenotype_coloc		06_phenotype_coloc
07_gwas_catalog		07_gwas_catalog
08_phenotype_consolidation		08_phenotype_consolidation
functions		functions
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Nanoparticle-enriched proteomics in British South Asian individuals identifies mechanistic links between genetic variants, plasma protein levels and disease risk

Note

Project structure

Software

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages