Skip to content

biocypher/open-targets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

345 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioCypher Open Targets Data (25.12) Adapter

Python Version License

This repository contains a BioCypher adapter for Open Targets data version 25.12. The project is currently under active development.

AI Usage Disclaimer: This project makes extensive use of AI. The author guarantees that every effort has been made to oversee the architecture and coding style and review the generated content for quality and consistency.

Table of Contents

Overview

BioCypher's modular design enables the use of different adapters to consume various data sources and produce knowledge graphs. This adapter serves as a "secondary adapter" for Open Targets data, meaning it adapts a pre-harmonised composite of atomic resources via the Open Targets pipeline.

The adapter includes a comprehensive reference knowledge graph with predefined sets of node types (entities) and edge types (relationships), or in the language of this adapter, presets of node and edge definitions. A script is provided to run BioCypher with the adapter, creating a knowledge graph with all predefined nodes and edges. On a consumer laptop, building the full graph typically takes 1-2 hours.

Key Features:

  • Includes a comprehensive knowledge graph definition designed to cover all data provided by the Open Targets Platform. See Reference Knowledge Graph for details.
  • Declarative syntax for graph schema construction
  • Powered by duckdb for fast and memory-efficient processing
  • True streaming from datasets to BioCypher with minimal intermediate memory usage
  • Type-safe schema representation with Python classes

Prerequisites

  • uv for dependency management

Installation

  1. Clone the repository:

    git clone https://github.com/biocypher/open-targets.git
    cd open-targets
  2. Install dependencies:

    uv sync
  3. Activate the virtual environment (optional):

    source .venv/bin/activate  # On Unix/macOS
    .venv\Scripts\activate    # On Windows

    Or run commands directly with uv run:

    uv run python <script>

Usage

Runnable examples are provided in the example/ directory. Each example includes:

  • A Python script demonstrating usage
  • Configuration files
  • Data preparation instructions in datasets/README.md

Important

When running the example scripts, ensure your current working directory is the project root.

Quick Start

  1. Follow the Installation steps
  2. Navigate to an example directory:
    cd example/full_graph
  3. Follow the data preparation instructions in datasets/README.md
  4. Important: Ensure your current working directory is the project root when running example scripts:
    cd /path/to/open-targets  # Navigate to project root
    uv run python example/full_graph/full_graph.py

Available Examples

  • Full Graph (example/full_graph/): Builds the complete reference knowledge graph using all predefined definitions
  • Custom Subset (example/custom_subset/): Demonstrates selecting specific node/edge definitions

See example/README.md for details on all available examples and how to create your own.

Reference Knowledge Graph

The reference knowledge graph includes 40+ node types and 50+ edge types. Each definition file contains detailed docstrings explaining what it does, what data it uses, and how it works.

Finding Definition Files

  • Node definitions: open_targets/definition/reference_kg/node/
  • Edge definitions: open_targets/definition/reference_kg/edge/
  • Complete list: open_targets/definition/reference_kg/kg.py

Reading Docstrings

Each definition file (e.g., node_target.py, edge_molecule_has_adverse_reaction_adverse_reaction.py) starts with a module-level docstring that describes:

  • What entities/relationships the definition creates
  • Which Open Targets datasets it uses
  • How the data is transformed
  • What properties are included

Example:

# In open_targets/definition/reference_kg/node/node_target.py
"""Summary: Ensembl target gene nodes (symbol/name/biotype/functions).

Definition for TARGET nodes: scans the Targets parquet to emit Ensembl gene
targets with symbol, name, biotype, and function descriptions as the core
target entities used across drug, association, and annotation edges in the KG.
"""

To explore available definitions:

  1. Browse the files in open_targets/definition/reference_kg/node/ and edge/
  2. Read the docstring at the top of each file
  3. Check kg.py to see how definitions are organized

For details on creating custom definitions, see the adapter layer documentation in open_targets/adapter/ and examine existing definitions as examples.

Open Targets Data Schema

The full schema of Open Targets data is represented as Python classes in open_targets/data/schema.py. This provides type checking for dataset and field references.

Naming Conventions:

  • Dataset classes: Dataset prefix (e.g., DatasetTargets, DatasetDiseases)
  • Field classes: Field prefix (e.g., FieldTargetsId, FieldTargetsApprovedSymbol)
  • Field names follow their structural location in datasets

The schema is generated from Open Targets metadata using code generation (see below). Schema classes are used throughout node/edge definitions for type-safe data access.

Code Generation

The Open Targets data schema is generated using Jinja templates.

  • Templates: open_targets/*.jinja
  • Generated code: open_targets/data/schema.py
  • Generation script: code_generation/generate.py

Important: Never edit generated files directly. Always modify templates and regenerate:

python code_generation/generate.py

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or create an Issue if you discover any problems.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5