Skip to content

End to end sim for secure GWAS and PGS benchmarking. Uses 1000G based genotypes, with synthetic T2D and BMI traits from the PGS catalog. Compares baseline, differential privacy, and federated learning results. Data and PGS files not in repo, fetch from IGSR and the PGS Catalog. R code with bash/Slurm, and built to run on an HPC cluster. WIP.

Notifications You must be signed in to change notification settings

mfjoel01/Genomic-Data-Privacy-Simulation

Repository files navigation

Genomic Data Privacy Simulation

This repository contains scripts and workflows for simulating genomic data and evaluating privacy-preserving techniques across multiple analysis stages. Pipelines are organized by numbered folders that follow the order of execution, from data preparation through method comparison.

Repository structure

  • 0_Prep/ – utilities for preparing simulation inputs and shared resources.
  • 00_HapGen2Simulation/ – configuration and helpers for running HapGen2-based genotype simulations.
  • 1_Phenotype/ – scripts for generating phenotypes from the simulated genotypes.
  • 2_Baseline/ – baseline association analyses used as reference points.
  • 3_DifferentialPrivacy/ – experiments applying differential privacy mechanisms to the analyses.
  • 4_FederatedLearning/ – materials related to federated learning workflows.
  • 5_MethodComparison/ – notebooks and scripts comparing results across approaches.
  • Scripts/ and R/ – shared helper scripts and R utilities used throughout the project.
  • R Libraries/ - installation location of needed R packages

Data availability

No datasets are hosted in this repository. To reproduce the simulations you will need to provide your own input data and configure paths accordingly. More information and download links for needed data can be found at the end.

Quick start

  1. Clone the repository and create a working directory for intermediate outputs.
  2. Download the six needed data files.
  3. Configure environment variables (e.g., reference data paths) expected by the shell drivers in Scripts/. These scripts orchestrate the numbered stages and can be adapted to your local toolchain (R, HapGen2, etc.).
  4. Review the configuration files within each stage directory and update them to point to your local input data.

How to run

Review the numbered folders in order to understand the expected workflow. Each stage contains notes or scripts that describe required dependencies. After tailoring configuration variables to your environment:

  • To execute the full pipeline sequentially, use the consolidated driver:

    bash Scripts/run_stages.sh
  • You can also specify to run only certain stages

    bash Scripts/run_stages.sh 2 4     # runs stages 2, 3, then 4
  • Individual stages can also be run with the dedicated helpers, for example:

    bash Scripts/0_prep_driver.sh      # prepare shared inputs
    bash Scripts/1_pheno_driver.sh     # derive phenotypes
    bash Scripts/2_baseline_driver.sh  # perform baseline analyses

    Adjust each script's configuration variables before running to point to your data locations.

    Outputs (high‑level)

  • Plots: Manhattan/QQ, ROC curves, binned BMI vs PGS, K‑sweep panels, effect/score comparisons, per‑person y=x, and run‑to‑run variability. See each stage’s output/graphs/.
  • Text: AUC/R² summaries, Bonferroni thresholds, DP selection summaries, sweep metrics (CSV). Global caches live under Data/global/.

Required Data

1) 1000 Genomes, Phase 3, GRCh37, chr22


2) PGS Catalog scoring files, harmonized GRCh37

About

End to end sim for secure GWAS and PGS benchmarking. Uses 1000G based genotypes, with synthetic T2D and BMI traits from the PGS catalog. Compares baseline, differential privacy, and federated learning results. Data and PGS files not in repo, fetch from IGSR and the PGS Catalog. R code with bash/Slurm, and built to run on an HPC cluster. WIP.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published