mtrKG is a Metabolite Ratio Knowledge Graph project that integrates local rQTL outputs with public resources (GWAS Catalog, Open Targets, STRING, Ensembl Regulatory Build, Reactome, HMDB, Rhea), then supports SPARQL analysis, SHACL validation, and drug repurposing experiments.
- A schema and integration pipeline that builds RDF/Turtle knowledge graphs.
- Analysis queries (
.rq) and notebook workflows for hypothesis-driven exploration. - Validation shapes and reports (PySHACL).
- Optional ML-based link prediction for drug repurposing (PyKEEN/TransE).
- Ingest local rQTL JSON files into an RDF graph.
- Enrich the graph with external biomedical sources.
- Serialize the graph to Turtle (
output/mtrKG_01.ttl/output/mtrKG.ttl). - Load the graph into GraphDB (repository usually named
mtrKG). - Run SPARQL analysis from
analysis/*.rq. - Save query outputs as CSV files for downstream reporting.
- Validate graph quality with SHACL constraints.
| Path | Purpose | Key contents |
|---|---|---|
src/ |
Core code and notebooks | Integrators, schema, SPARQL engines, build/analysis/validation notebooks |
analysis/ |
SPARQL and rule assets | 002_query.rq ... 014_query.rq, q10_query_construct.rq, rqtl_ruleset.pie, generated *.csv |
data/ |
Source datasets and DB artifacts | json_files/ (rQTL JSONs), GWAS/HMDB exports, SQLite assets |
output/ |
Generated outputs | KG Turtle files, integration logs/reports, drug predictions, validation outputs |
validation/ |
SHACL shapes and reports | rqtl_shapes.ttl, schema_validation.ttl, validation reports |
doc/ |
Documentation assets | Drawings, pathway visuals, exported HTML/PDF diagrams |
notebook/ |
Auxiliary notebook area | Additional/legacy notebook content |
sandbox/ |
Experimental work area | Scratch notebooks, prototypes, intermediate artifacts |
| Notebook | Purpose |
|---|---|
src/create_mtrKG.ipynb |
Main KG build and enrichment pipeline |
src/analyse_graph.ipynb |
Runs SPARQL analyses over GraphDB and exports query outputs |
src/validation.ipynb |
Runs PySHACL validation and writes reports |
src/utility.ipynb |
Utility analysis/visualization snippets |
src/notebook.ipynb, src/notebook_01.ipynb |
Broader exploratory/legacy workflows |
| Module | Responsibility |
|---|---|
src/schema_definition.py |
Declares namespaces, classes, and predicates; builds ontology/schema scaffold |
src/integrate_rQTLs.py |
Ingests local rQTL JSON and creates core ratio/variant/causal structures |
src/integrate_gwas_catalog.py |
Enriches SNPs with GWAS Catalog associations |
src/integrate_open_targets.py |
Adds target tractability, diseases, liabilities, and known drugs |
src/integrate_string.py |
Adds gene-gene interaction edges from STRING |
src/integrate_encode.py |
Adds SNP overlap with Ensembl regulatory/motif features |
src/integrate_reactome.py |
Adds gene/metabolite pathway context from Reactome |
src/integrate_HMDB.py |
Adds HMDB metabolite location knowledge |
src/integrate_rhea.py |
Adds reaction participation via Rhea SPARQL |
src/integrate_ewas.py |
Optional EWAS enrichment module |
src/graphdb_engine.py |
Executes SPARQL against GraphDB and returns DataFrames |
src/execute_sparql.py |
Executes SPARQL directly on local RDFLib graph |
src/predict_drugs.py |
Trains TransE embeddings and generates drug repurposing candidates |
data/json_files/: local rQTL JSON inputs (5,095 files in this workspace snapshot).data/hmdb_metabolites.xml,data/hmdb_proteins.xml: HMDB bulk XML resources.data/gwas-catalog-download-associations-alt-full.tsv: GWAS association table.data/Human-GEM.xml: metabolic model resource.data/instance/metabolite-ratio-app.sqlite: local app database asset.data/populate_db_with_jsons.py: helper script for filling the SQLite app schema.
The repository includes a root requirements.txt that centralizes Python dependencies used by the main scripts and notebooks.
Included package groups:
- Core KG build/query stack:
rdflib,pandas,requests,SPARQLWrapper,urllib3 - Validation workflow:
pyshacl - Drug repurposing workflow:
numpy,scipy,torch,pykeen - Notebook/visualization utilities:
tqdm,networkx,matplotlib,pyvis,qrcode[pil]
Install with:
pip install -r requirements.txtRecommended entry point: src/create_mtrKG.ipynb.
The notebook pipeline applies integrations in this sequence:
- rQTL local JSON ingestion
- GWAS Catalog
- Open Targets
- STRING
- Ensembl Regulatory Build
- Reactome (genes, then metabolites)
- HMDB
- Rhea
- Serialize graph to Turtle
- Optional drug repurposing
Typical output files are written under output/ and output/integration/:
mtrKG.ttl/mtrKG_01.ttl*_integration.log*_mapping_report.csvdrug_repurposing_predictions.csv
Primary analysis notebook: src/analyse_graph.ipynb.
Queries are stored in analysis/*.rq and executed via query_graphdb() in src/graphdb_engine.py.
Current behavior:
- Reads each
.rqfile. - Prints the exact SPARQL query text being executed.
- Executes the query on GraphDB (
repo_name="mtrKG"by default). - Saves SELECT results as CSV with the same base name:
analysis/002_query.rq->analysis/002_query.csvanalysis/013_query_count_triples.rq->analysis/013_query_count_triples.csv
GraphDB prerequisites:
- GraphDB is running locally.
- Repository exists (usually
mtrKG). - The generated Turtle graph is loaded into that repository.
Validation assets:
validation/rqtl_shapes.ttlvalidation/schema_validation.ttl
Validation execution is shown in src/validation.ipynb using pyshacl.validate(...).
Typical outputs:
output/validation_report.ttloutput/validation_text.txt
analysis/rqtl_ruleset.pie contains rule templates for deriving additional relations, including:
- putative SNP-to-gene implication
- ratio-to-pathway inference
- putative disease implication chains
- candidate therapeutic target links
This file is intended for rule-engine workflows (for example, GraphDB rule sets).
src/predict_drugs.py:
- Loads structural triples from Turtle.
- Trains a TransE embedding model with PyKEEN.
- Scores disease-drug proximity in embedding space.
- Writes predictions to
output/drug_repurposing_predictions.csv.
- Several integration scripts call external APIs; internet access and rate limits matter.
- Some outputs and logs are large; avoid committing regenerated artifacts unintentionally.
- Relative paths in notebooks are usually written with execution from the
src/context. - If you run outside notebooks, ensure your working directory keeps file paths consistent.