Skip to content
@omicidx

OmicIDX

Computable, searchable index of millions of samples of public omics data.

OmicIDX

Open, cloud-native indices of public genomics metadata — updated daily.

OmicIDX architecture: 6 data sources flow through a daily ETL pipeline to cloud Parquet files queryable from Python, R, DuckDB, or AI assistants

OmicIDX transforms raw NCBI and EBI metadata dumps into clean, analysis-ready Apache Parquet files served from a public CDN. Query millions of records directly from Python, R, DuckDB, or any Parquet-capable tool — no database server, no API keys, no downloads required.

Data Sources

Source Records Description
SRA Studies, Samples, Experiments, Runs NCBI Sequence Read Archive — the world's largest repository of raw sequencing data metadata
GEO Series, Samples, Platforms Gene Expression Omnibus — processed gene expression and functional genomics datasets
BioSample BioSample records Biological source material descriptions across all NCBI submissions
BioProject BioProject records Project-level organization linking studies, samples, and data
PubMed Articles with full abstracts Biomedical literature including titles, authors, MeSH terms, references, and complete abstract text
EBI BioSamples BioSample records European Bioinformatics Institute sample metadata with structured characteristics

Quick Start

All data is publicly available as Parquet files — no authentication needed.

Python (DuckDB)

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")

# Query SRA runs directly from the cloud
df = con.sql("""
    SELECT *
    FROM 'https://data-omicidx.cancerdatasci.org/sra/parquet/sra_runs.parquet'
    WHERE organism = 'Homo sapiens'
    LIMIT 100
""").df()

R (arrow)

library(arrow)

sra_studies <- read_parquet(
  "https://data-omicidx.cancerdatasci.org/sra/parquet/sra_studies.parquet"
)

CLI (DuckDB)

duckdb -c "
  SELECT count(*) as total_runs
  FROM 'https://data-omicidx.cancerdatasci.org/sra/parquet/sra_runs.parquet'
"

Available Parquet Files

All files are served from https://data-omicidx.cancerdatasci.org/:

File Path
SRA Studies sra/parquet/sra_studies.parquet
SRA Samples sra/parquet/sra_samples.parquet
SRA Experiments sra/parquet/sra_experiments.parquet
SRA Runs sra/parquet/sra_runs.parquet
SRA Accessions sra/parquet/sra_accessions.parquet
GEO Series geo/parquet/geo_series.parquet
GEO Samples geo/parquet/geo_samples.parquet
GEO Platforms geo/parquet/geo_platforms.parquet
BioSamples biosample/parquet/biosamples.parquet
BioProjects bioproject/parquet/bioprojects.parquet
PubMed pubmed/raw/pubmed*.parquet

Repositories

Repo Description
omicidx-etl Automated ETL pipelines — GitHub Actions + systemd orchestration, daily updates with full visibility into pipeline status
omicidx-parsers NCBI XML parsing library with Pydantic v2 models (PyPI: omicidx)
omicidx-mcp MCP server — lets AI assistants (Claude, Cursor) query OmicIDX data via natural language SQL

ETL Pipeline

The omicidx-etl repo orchestrates daily data updates with full transparency:

  • Daily GitHub Actions workflows for SRA, GEO, BioSample, BioProject, PubMed, and EBI BioSamples
  • Incremental updates — fetches only new/changed records where possible, deduplicates automatically
  • SQL transformation layer — DuckDB-based consolidation from raw partitioned files into final Parquet tables
  • Pipeline visibility — all workflow runs are public on GitHub Actions; check the Actions tab for current status

AI-Powered Queries

omicidx-mcp exposes the full OmicIDX dataset to AI coding assistants through the Model Context Protocol. Connect it to Claude, Cursor, or any MCP-compatible tool and query genomics metadata with natural language — no SQL expertise required.

Popular repositories Loading

  1. omicidx-parsers omicidx-parsers Public

    Exposing public genomics data via computable and searchable metadata

    Python 13

  2. omicidx-etl omicidx-etl Public archive

    ARCHIVED — moved to https://github.com/omicidx/omicidx

    Python 4

  3. umls_thesaurus umls_thesaurus Public

    Python

  4. .github .github Public

    Omicidx org repo

Repositories

Showing 4 of 4 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…