This repository contains a BioCypher adapter for Open Targets data version 25.12. The project is currently under active development.
AI Usage Disclaimer: This project makes extensive use of AI. The author guarantees that every effort has been made to oversee the architecture and coding style and review the generated content for quality and consistency.
- Overview
- Prerequisites
- Installation
- Usage
- Reference Knowledge Graph
- Open Targets Data Schema
- Code Generation
- Contributing
- License
BioCypher's modular design enables the use of different adapters to consume various data sources and produce knowledge graphs. This adapter serves as a "secondary adapter" for Open Targets data, meaning it adapts a pre-harmonised composite of atomic resources via the Open Targets pipeline.
The adapter includes a comprehensive reference knowledge graph with predefined
sets of node types (entities) and edge types (relationships), or in the language
of this adapter, presets of node and edge definitions. A script is provided
to run BioCypher with the adapter, creating a knowledge graph with all predefined
nodes and edges. On a consumer laptop, building the full graph typically takes
1-2 hours.
Key Features:
- Includes a comprehensive knowledge graph definition designed to cover all data provided by the Open Targets Platform. See Reference Knowledge Graph for details.
- Declarative syntax for graph schema construction
- Powered by duckdb for fast and memory-efficient processing
- True streaming from datasets to BioCypher with minimal intermediate memory usage
- Type-safe schema representation with Python classes
- uv for dependency management
-
Clone the repository:
git clone https://github.com/biocypher/open-targets.git cd open-targets -
Install dependencies:
uv sync
-
Activate the virtual environment (optional):
source .venv/bin/activate # On Unix/macOS .venv\Scripts\activate # On Windows
Or run commands directly with
uv run:uv run python <script>
Runnable examples are provided in the example/ directory. Each example includes:
- A Python script demonstrating usage
- Configuration files
- Data preparation instructions in
datasets/README.md
Important
When running the example scripts, ensure your current working directory is the project root.
- Follow the Installation steps
- Navigate to an example directory:
cd example/full_graph - Follow the data preparation instructions in
datasets/README.md - Important: Ensure your current working directory is the project root when running example scripts:
cd /path/to/open-targets # Navigate to project root uv run python example/full_graph/full_graph.py
- Full Graph (
example/full_graph/): Builds the complete reference knowledge graph using all predefined definitions - Custom Subset (
example/custom_subset/): Demonstrates selecting specific node/edge definitions
See example/README.md for details on all available examples and how to create your own.
The reference knowledge graph includes 40+ node types and 50+ edge types. Each definition file contains detailed docstrings explaining what it does, what data it uses, and how it works.
- Node definitions:
open_targets/definition/reference_kg/node/ - Edge definitions:
open_targets/definition/reference_kg/edge/ - Complete list:
open_targets/definition/reference_kg/kg.py
Each definition file (e.g., node_target.py, edge_molecule_has_adverse_reaction_adverse_reaction.py) starts with a module-level docstring that describes:
- What entities/relationships the definition creates
- Which Open Targets datasets it uses
- How the data is transformed
- What properties are included
Example:
# In open_targets/definition/reference_kg/node/node_target.py
"""Summary: Ensembl target gene nodes (symbol/name/biotype/functions).
Definition for TARGET nodes: scans the Targets parquet to emit Ensembl gene
targets with symbol, name, biotype, and function descriptions as the core
target entities used across drug, association, and annotation edges in the KG.
"""To explore available definitions:
- Browse the files in
open_targets/definition/reference_kg/node/andedge/ - Read the docstring at the top of each file
- Check
kg.pyto see how definitions are organized
For details on creating custom definitions, see the adapter layer documentation in open_targets/adapter/ and examine existing definitions as examples.
The full schema of Open Targets data is represented as Python classes in open_targets/data/schema.py. This provides type checking for dataset and field references.
Naming Conventions:
- Dataset classes:
Datasetprefix (e.g.,DatasetTargets,DatasetDiseases) - Field classes:
Fieldprefix (e.g.,FieldTargetsId,FieldTargetsApprovedSymbol) - Field names follow their structural location in datasets
The schema is generated from Open Targets metadata using code generation (see below). Schema classes are used throughout node/edge definitions for type-safe data access.
The Open Targets data schema is generated using Jinja templates.
- Templates:
open_targets/*.jinja - Generated code:
open_targets/data/schema.py - Generation script:
code_generation/generate.py
Important: Never edit generated files directly. Always modify templates and regenerate:
python code_generation/generate.pyContributions are welcome! Please feel free to submit a Pull Request or create an Issue if you discover any problems.
This project is licensed under the MIT License - see the LICENSE file for details.