Skip to content

WGLab/PhenoSS

Repository files navigation

PhenoSS: Phenotype semantic similarity-based approach for rare disease prediction and patient clustering

Introduction

PhenoSS is an effective algorithm that makes disease prediction and performs patient clustering based on HPO concepts. PhenoSS uses the Gaussian copula technique by modeling the marginal prevalence of each HPO term for each disease and utilizes a multivariate normal distribution to link them together to account for term correlations. We utilized the OARD (open annotations for rare diseases) API for inferring the frequency of HPO terms in a diverse range of rare diseases. PhenoSS can calculate the phenotype similarity between any two patients for finding similar patients or clustering purposes, or between one patient and any candidate diseases for diagnosis support.

The toolkit is implemented in Python.

Installation

1) Ensure to download several databases for this application:

DIshIN (required for ssmpy)

curl -L -O http://labs.rd.ciencias.ulisboa.pt/dishin/hp202506.db.gz
gunzip -N hp202506.db.gz

(you will find "hp.db" on the local directory)

HPO-Disease Frequency

OARD

  • There is no need to download anything here as the database will be called out during running. (https://rare.cohd.io/)

MONDO

  • You may need this data if you need to convert from OMIM to MONDO or Orphanet to MONDO. Howewever, we directly provide the conversion files (./doc/database/omim_conversion.json | orphanet_conversion.json) for you (note that this can be outdated).
  • Alternatively, download the mondo-edges.tsv for conversion.

2) Code environment:

Ensure that you install required packages:

pip install pandas numpy requests ssmpy scipy

3) You need to run 'python processing_hpo_frequency.py' to obtain ./doc/database/hpo_frequency.csv (one-time only)

Tutorial

In this repo, we present two separate pipelines:

a) Patient Clustering

b) Diease Prediction (PhenoSS)

Patient Clustering

Sample HPO data

The file hpo_list contains the synthetic data for three randomly generated patients labeled 0_10, 1_10, 2_10.

0_10    HP_0004370;HP_0000280;HP_0002835;HP_0005274;HP_0000158;HP_0011470;HP_0001417;HP_0001270;HP_0008872;HP_0002015;HP_0000750;HP_0000157;
1_10    HP_0000483;HP_0002307;HP_0001090;HP_0001572;HP_0002342;HP_0011343;HP_0008760;HP_0001061;HP_0001249;HP_0000574;HP_0001417;HP_0002020;HP_0012810;HP_000
0540;HP_0001350;HP_0001270;HP_0002574;HP_0011231;HP_0000750;HP_0002155;HP_0000431;HP_0000718;
2_10    HP_0002194;HP_0001263;HP_0001684;HP_0001417;HP_0001270;HP_0001249;HP_0001670;HP_0001667;HP_0001629;HP_0001639;HP_0002474;HP_0010863;HP_0000750;

Similairty score calculation

Using the following argument, we can calculate the similarity scores between patient 1_10 and each of the patients in the hpo_list. The first input argument is the input file that contains the patient IDs and the HPO terms.

python similarity_score.py -input_dir [YOUR INPUT DIRECTORY] -output_dir [YOUR OUTPUT DIRECTORY]

The outputs of the argument can be found in the file 1_10_sim.

pat1 pat2 hpo1 hpo2 similarity
0_10 1_10 HP_0004370;HP_0000280;HP_0002835;... HP_0000483;HP_0002307;HP_0001090;... 3.74
0_10 2_10 HP_0004370;HP_0000280;HP_0002835;... HP_0002194;HP_0001263;HP_0001684 8.09
1_10 2_10 HP_0000483;HP_0002307;HP_0001090;... HP_0002194;HP_0001263;HP_0001684;... 2.45

Disease prediction

PhenoSS extracts the diseases/phenotype frequencies from the Open Annotations for Rare Diseases (OARD) and Human Phenotype Ontology Databases. It takes in HPO terms of a list of patients and outputs the ranks of possible underlying diseases.

Below is a sample input file:

P1	HP_0012759;HP_0000750;HP_0100022;HP_0000707;
P2	HP_0001270;HP_0012758;HP_0002066;HP_0011443;
P3	HP_0012758;HP_0002167;HP_0012638;HP_0000707;

You can use "PhenoSS_Codebook.ipynb" if you prefer the interactive browser. Otherwise to run PhenoSS, use the following command:

bash run_phenoss.sh \
  --inputfile data/patient_hpos.tsv \
  --outputfile results/phenoss_output.tsv \
  --mode oard_first \
  --freq_assignment extrinsic_ic \
  --method Resnik \
  --hp_db_sqlite hp.db \
  --hpo_db_path ./doc/database/hpo_frequency.csv \
  --url https://rare.cohd.io/api \
  --dataset_id 2 \
  --gene_conversion \
  --gene_of_interest GATA2 \
  --gene_outfile results/gene_results.tsv
Argument Required Default Description
--inputfile Yes Input file containing patient HPO phenotypes
--outputfile Yes Output ranking file
--mode No oard_first Candidate disease selection strategy (oard_only, oard_first, hpodb_first, hpodb_only)
--freq_assignment No extrinsic_ic Frequency assignment method for HPO terms
--method No Resnik Semantic similarity method
--hp_db_sqlite No hp.db SQLite database for HPO ontology
--hpo_db_path No hpo_frequency.csv HPO frequency table
--gene_conversion No (flag) Off Convert diseases to genes in output. Note that diseases with unknown genes will be removed.
--url No https://rare.cohd.io/api COHD API endpoint
--dataset_id No 2 Dataset ID for COHD
--gene_of_interest No empty Specific gene to evaluate
--gene_outfile No empty Output file for gene results

The results consist of a list of MONDO diseases and the rankings and will be stored in 'outputFile' specified by the user.

Output Example

patient_id disease_mondo gene disease_id score rank
PATIENT_001 MONDO:0001234 GENE1 80001234 -0.941566671739919 1
PATIENT_001 MONDO:0001234 GENE2 80001234 -0.941566671739919 1
PATIENT_001 MONDO:0005678 NA 80005678 -0.941503249828281 3

License

PhenoSS is distributed under the MIT License by Wang Genomics Lab.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published