This repository contains scripts and workflows for simulating genomic data and evaluating privacy-preserving techniques across multiple analysis stages. Pipelines are organized by numbered folders that follow the order of execution, from data preparation through method comparison.
0_Prep/– utilities for preparing simulation inputs and shared resources.00_HapGen2Simulation/– configuration and helpers for running HapGen2-based genotype simulations.1_Phenotype/– scripts for generating phenotypes from the simulated genotypes.2_Baseline/– baseline association analyses used as reference points.3_DifferentialPrivacy/– experiments applying differential privacy mechanisms to the analyses.4_FederatedLearning/– materials related to federated learning workflows.5_MethodComparison/– notebooks and scripts comparing results across approaches.Scripts/andR/– shared helper scripts and R utilities used throughout the project.R Libraries/- installation location of needed R packages
No datasets are hosted in this repository. To reproduce the simulations you will need to provide your own input data and configure paths accordingly. More information and download links for needed data can be found at the end.
- Clone the repository and create a working directory for intermediate outputs.
- Download the six needed data files.
- Configure environment variables (e.g., reference data paths) expected by the shell drivers in
Scripts/. These scripts orchestrate the numbered stages and can be adapted to your local toolchain (R, HapGen2, etc.). - Review the configuration files within each stage directory and update them to point to your local input data.
Review the numbered folders in order to understand the expected workflow. Each stage contains notes or scripts that describe required dependencies. After tailoring configuration variables to your environment:
-
To execute the full pipeline sequentially, use the consolidated driver:
bash Scripts/run_stages.sh
-
You can also specify to run only certain stages
bash Scripts/run_stages.sh 2 4 # runs stages 2, 3, then 4 -
Individual stages can also be run with the dedicated helpers, for example:
bash Scripts/0_prep_driver.sh # prepare shared inputs bash Scripts/1_pheno_driver.sh # derive phenotypes bash Scripts/2_baseline_driver.sh # perform baseline analyses
Adjust each script's configuration variables before running to point to your data locations.
- Plots: Manhattan/QQ, ROC curves, binned BMI vs PGS, K‑sweep panels, effect/score comparisons, per‑person y=x, and run‑to‑run variability. See each stage’s
output/graphs/. - Text: AUC/R² summaries, Bonferroni thresholds, DP selection summaries, sweep metrics (CSV). Global caches live under
Data/global/.
-
VCF
ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gzLink: https://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz Place at:00_HapGen2Simulation/input/1000G/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -
Index (.tbi)
ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbiLink: https://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi Place at:00_HapGen2Simulation/input/1000G/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi -
Sample panel (superpopulation labels for EUR subsetting)
integrated_call_samples_v3.20130502.ALL.panelLink: https://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/sample_info/integrated_call_samples_v3.20130502.ALL.panel Place at:00_HapGen2Simulation/input/1000G/integrated_call_samples_v3.20130502.ALL.panel -
Genetic map GRCh37 for chr22
genetic_map_chr22_combined_b37.txtGet it from the IMPUTE2, 1000G Phase 3 archive: https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html Place at:00_HapGen2Simulation/input/1000G/genetic_map_chr22_combined_b37.txt -
IGSR Phase 3 overview https://www.internationalgenome.org/data-portal/data-collection/phase-3
-
T2D, PGS003443 harmonized positions
PGS003443_hmPOS_GRCh37.txt.gzScore page: https://www.pgscatalog.org/score/PGS003443/ Place at:Data/pgs/PGS003443_hmPOS_GRCh37.txt.gz -
BMI, PGS004994 harmonized positions
PGS004994_hmPOS_GRCh37.txt.gzScore page: https://www.pgscatalog.org/score/PGS004994/ Place at:Data/pgs/PGS004994_hmPOS_GRCh37.txt.gz -
PGS Catalog download notes https://www.pgscatalog.org/downloads/