Shadi Zabad, Yue Li, Simon Gravel
McGill University
This repository contains code and scripts to reproduce all the analyses in our manuscript, which proposes
an ensemble learning technique for polygenic risk scores (PRS) inference. The MoEPRS model takes as input a set of
pre-trained polygenic scores (inferred from various biobanks or strata) as well as individual-level data
for a validation dataset and learns to assign individuals the best scores for them in a personalized-manner.
To get an intuitive understanding of what the MoEPRS model is set up to do, please check the
notebooks directory, which includes simple,
intuitive examples with simulations and associated visualizations.
The code base is still in experimental phase and needs some more work to streamline the installation. To get started with the current code base, please clone the repository and then install the dependencies:
git clone https://github.com/shz9/moe-prs-paper.git
pip install -r moe-prs-paper/requirements.txtOur analyses involve using individual-level data that is not publicly available. However, if you have access to
the UK or CARTaGENE biobanks, you should be able to replicate all the steps in here. If you have access to
third party datasets that you'd like to use the MoEPRS model on, then you can modify some of the steps to
ensure that your data conforms to the same formats.
Fitting the MoEPRS model to validation data requires extracting and pre-processing the three following data
sources:
- Genotype and phenotype data for a large number of individuals (ideally > 1000):
MoEPRSrequires a validation dataset with paired genotype and phenotype data. For a phenotype of interest, let's say standing height, you need to extract the phenotype data and pre-process it before model fitting. - Polygenic score weights: Once you decide on the phenotype of interest, then you should also collect relevant stratified polygenic scores for that same phenotype. Here, you can either use the PGS Catalog or in-house polygenic score weights.
- Covariates: In our setup, the gating model takes as input individual attributes (i.e. covariates) and uses them to learn cohort partitions. These covariates can be things like the individual's age, sex, genomic PCs. Note that we also correct for covariates effects globally to ensure that the gate does not pick on spurious, non-genetic variation.
In our pipeline, the first step was to download the polygenic scores from the PGS Catalog, which can be done with the help of the scripts batch_download_pgs.py or download_pgs.py
NOTE: If your polygenic score weights are not from the PGS Catalog, then please ensure that they are formatted
using the PGS Catalog PRS file format. This is important because we use pgsc_calc software utility from the PGS
Catalog to perform scoring in the next steps.
Once these PGSs are downloaded and harmonized, then the next step is to extract and prepare the individual-level data. Here, each biobank will be different, but if you have access to the UK/CARTaGENE Biobanks, then you can follow the steps in these directories:
The scripts in these directories will: Extract necessary genotype data, phenotype data, as well as covariates.
To generate polygenic scores for individuals in the validation data, we are going to use the pgsc_calc software
utility from the PGS Catalog. The benefit of this framework is that it provides standardized ways to tackle
genotype data in various formats (.bed, .pgen, etc.) and it also handles lifting over coordinates, if necessary.
To see how we used the utility for both the UK and CARTaGENE biobanks, refer to the scripts in these directories:
- UK Biobank: Scripts for scoring with
pgsc_calcare here. - CARTaGENE Biobank: Scripts for scoring with
pgsc_calcare here.
Once all the input data sources have been extracted or computed, then the next step is to harmonize them into a single data frame. This can be done with the help of the utility script create_datasets.py. Note that this script, in its default settings, will ask you for the paths of: (1) the phenotype file, (2) the polygenic score files, and (3) the covariates file.
Once your datasets have been extracted and pre-processed, then you can go ahead and experiment with fitting
the MoEPRS model or even the baseline MultiPRS. Both of these models are implemented in:
- model/moe.py: Implements the
MoEPRSmodel using the EM algorithm. - model/baseline_models.py: Implements the baseline models, including the
MultiPRS.
To see an example of how both of these models are used from the python API, see the notebooks directory.
To run these models in batch mode for a bunch of different phenotypes, then you can use the train_models.py utility script.
NOTE: You can also explore the PRSDataset class, which implements a python wrapper around the simple
dataframe and provides many functionalities for pre-processing and working with PRS and phenotype data.
Once the ensemble PRS models have been trained, the next step is to evaluate their performance and examine the learned representations using a number of visualization and analysis tools.
For evaluation, you can consult the script evaluation/evaluate_predictive_performance.py, which provides utilities for computing stratified evaluation metrics.
For visualization, you can consult the scripts in plotting/, which provide utilities for plotting the evaluation metrics as well as generating the PGS admixture graphs highlighted in the text. If you're interested in experimenting with the PGS admixture graphs in particular, check the script plotting/plot_pgs_admixture.py.
@article {Zabad2025.09.15.676165,
author = {Zabad, Shadi and Li, Yue and Gravel, Simon},
title = {Personalized polygenic risk prediction and assessment with a Mixture-of-Experts framework},
elocation-id = {2025.09.15.676165},
year = {2025},
doi = {10.1101/2025.09.15.676165},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/09/17/2025.09.15.676165},
eprint = {https://www.biorxiv.org/content/early/2025/09/17/2025.09.15.676165.full.pdf},
journal = {bioRxiv}
}