mtrKG

mtrKG is a Metabolite Ratio Knowledge Graph project that integrates local rQTL outputs with public resources (GWAS Catalog, Open Targets, STRING, Ensembl Regulatory Build, Reactome, HMDB, Rhea), then supports SPARQL analysis, SHACL validation, and drug repurposing experiments.

What this repository contains

A schema and integration pipeline that builds RDF/Turtle knowledge graphs.
Analysis queries (.rq) and notebook workflows for hypothesis-driven exploration.
Validation shapes and reports (PySHACL).
Optional ML-based link prediction for drug repurposing (PyKEEN/TransE).

End-to-end workflow (high level)

Ingest local rQTL JSON files into an RDF graph.
Enrich the graph with external biomedical sources.
Serialize the graph to Turtle (output/mtrKG_01.ttl / output/mtrKG.ttl).
Load the graph into GraphDB (repository usually named mtrKG).
Run SPARQL analysis from analysis/*.rq.
Save query outputs as CSV files for downstream reporting.
Validate graph quality with SHACL constraints.

Repository map

Path	Purpose	Key contents
`src/`	Core code and notebooks	Integrators, schema, SPARQL engines, build/analysis/validation notebooks
`analysis/`	SPARQL and rule assets	`002_query.rq` ... `014_query.rq`, `q10_query_construct.rq`, `rqtl_ruleset.pie`, generated `*.csv`
`data/`	Source datasets and DB artifacts	`json_files/` (rQTL JSONs), GWAS/HMDB exports, SQLite assets
`output/`	Generated outputs	KG Turtle files, integration logs/reports, drug predictions, validation outputs
`validation/`	SHACL shapes and reports	`rqtl_shapes.ttl`, `schema_validation.ttl`, validation reports
`doc/`	Documentation assets	Drawings, pathway visuals, exported HTML/PDF diagrams
`notebook/`	Auxiliary notebook area	Additional/legacy notebook content
`sandbox/`	Experimental work area	Scratch notebooks, prototypes, intermediate artifacts

Main notebooks

Notebook	Purpose
`src/create_mtrKG.ipynb`	Main KG build and enrichment pipeline
`src/analyse_graph.ipynb`	Runs SPARQL analyses over GraphDB and exports query outputs
`src/validation.ipynb`	Runs PySHACL validation and writes reports
`src/utility.ipynb`	Utility analysis/visualization snippets
`src/notebook.ipynb`, `src/notebook_01.ipynb`	Broader exploratory/legacy workflows

Main Python modules

Module	Responsibility
`src/schema_definition.py`	Declares namespaces, classes, and predicates; builds ontology/schema scaffold
`src/integrate_rQTLs.py`	Ingests local rQTL JSON and creates core ratio/variant/causal structures
`src/integrate_gwas_catalog.py`	Enriches SNPs with GWAS Catalog associations
`src/integrate_open_targets.py`	Adds target tractability, diseases, liabilities, and known drugs
`src/integrate_string.py`	Adds gene-gene interaction edges from STRING
`src/integrate_encode.py`	Adds SNP overlap with Ensembl regulatory/motif features
`src/integrate_reactome.py`	Adds gene/metabolite pathway context from Reactome
`src/integrate_HMDB.py`	Adds HMDB metabolite location knowledge
`src/integrate_rhea.py`	Adds reaction participation via Rhea SPARQL
`src/integrate_ewas.py`	Optional EWAS enrichment module
`src/graphdb_engine.py`	Executes SPARQL against GraphDB and returns DataFrames
`src/execute_sparql.py`	Executes SPARQL directly on local RDFLib graph
`src/predict_drugs.py`	Trains TransE embeddings and generates drug repurposing candidates

Data assets

data/json_files/: local rQTL JSON inputs (5,095 files in this workspace snapshot).
data/hmdb_metabolites.xml, data/hmdb_proteins.xml: HMDB bulk XML resources.
data/gwas-catalog-download-associations-alt-full.tsv: GWAS association table.
data/Human-GEM.xml: metabolic model resource.
data/instance/metabolite-ratio-app.sqlite: local app database asset.
data/populate_db_with_jsons.py: helper script for filling the SQLite app schema.

Setup

The repository includes a root requirements.txt that centralizes Python dependencies used by the main scripts and notebooks.

Included package groups:

Core KG build/query stack: rdflib, pandas, requests, SPARQLWrapper, urllib3
Validation workflow: pyshacl
Drug repurposing workflow: numpy, scipy, torch, pykeen
Notebook/visualization utilities: tqdm, networkx, matplotlib, pyvis, qrcode[pil]

Install with:

pip install -r requirements.txt

Building the KG

Recommended entry point: src/create_mtrKG.ipynb.

The notebook pipeline applies integrations in this sequence:

rQTL local JSON ingestion
GWAS Catalog
Open Targets
STRING
Ensembl Regulatory Build
Reactome (genes, then metabolites)
HMDB
Rhea
Serialize graph to Turtle
Optional drug repurposing

Typical output files are written under output/ and output/integration/:

mtrKG.ttl / mtrKG_01.ttl
*_integration.log
*_mapping_report.csv
drug_repurposing_predictions.csv

SPARQL analysis workflow

Primary analysis notebook: src/analyse_graph.ipynb.

Queries are stored in analysis/*.rq and executed via query_graphdb() in src/graphdb_engine.py.

Current behavior:

Reads each .rq file.
Prints the exact SPARQL query text being executed.
Executes the query on GraphDB (repo_name="mtrKG" by default).
Saves SELECT results as CSV with the same base name:
- analysis/002_query.rq -> analysis/002_query.csv
- analysis/013_query_count_triples.rq -> analysis/013_query_count_triples.csv

GraphDB prerequisites:

GraphDB is running locally.
Repository exists (usually mtrKG).
The generated Turtle graph is loaded into that repository.

Validation workflow

Validation assets:

validation/rqtl_shapes.ttl
validation/schema_validation.ttl

Validation execution is shown in src/validation.ipynb using pyshacl.validate(...).

Typical outputs:

output/validation_report.ttl
output/validation_text.txt

Rule-based reasoning asset

analysis/rqtl_ruleset.pie contains rule templates for deriving additional relations, including:

putative SNP-to-gene implication
ratio-to-pathway inference
putative disease implication chains
candidate therapeutic target links

This file is intended for rule-engine workflows (for example, GraphDB rule sets).

Drug repurposing workflow

src/predict_drugs.py:

Loads structural triples from Turtle.
Trains a TransE embedding model with PyKEEN.
Scores disease-drug proximity in embedding space.
Writes predictions to output/drug_repurposing_predictions.csv.

Notes and practical considerations

Several integration scripts call external APIs; internet access and rate limits matter.
Some outputs and logs are large; avoid committing regenerated artifacts unintentionally.
Relative paths in notebooks are usually written with execution from the src/ context.
If you run outside notebooks, ensure your working directory keeps file paths consistent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mtrKG

What this repository contains

End-to-end workflow (high level)

Repository map

Main notebooks

Main Python modules

Data assets

Setup

Building the KG

SPARQL analysis workflow

Validation workflow

Rule-based reasoning asset

Drug repurposing workflow

Notes and practical considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
analysis		analysis
doc		doc
output		output
sandbox		sandbox
src		src
supplementary		supplementary
validation		validation
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

mtrKG

What this repository contains

End-to-end workflow (high level)

Repository map

Main notebooks

Main Python modules

Data assets

Setup

Building the KG

SPARQL analysis workflow

Validation workflow

Rule-based reasoning asset

Drug repurposing workflow

Notes and practical considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages