Open, cloud-native indices of public genomics metadata — updated daily.
OmicIDX transforms raw NCBI and EBI metadata dumps into clean, analysis-ready Apache Parquet files served from a public CDN. Query millions of records directly from Python, R, DuckDB, or any Parquet-capable tool — no database server, no API keys, no downloads required.
| Source | Records | Description |
|---|---|---|
| SRA | Studies, Samples, Experiments, Runs | NCBI Sequence Read Archive — the world's largest repository of raw sequencing data metadata |
| GEO | Series, Samples, Platforms | Gene Expression Omnibus — processed gene expression and functional genomics datasets |
| BioSample | BioSample records | Biological source material descriptions across all NCBI submissions |
| BioProject | BioProject records | Project-level organization linking studies, samples, and data |
| PubMed | Articles with full abstracts | Biomedical literature including titles, authors, MeSH terms, references, and complete abstract text |
| EBI BioSamples | BioSample records | European Bioinformatics Institute sample metadata with structured characteristics |
All data is publicly available as Parquet files — no authentication needed.
Python (DuckDB)
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
# Query SRA runs directly from the cloud
df = con.sql("""
SELECT *
FROM 'https://data-omicidx.cancerdatasci.org/sra/parquet/sra_runs.parquet'
WHERE organism = 'Homo sapiens'
LIMIT 100
""").df()R (arrow)
library(arrow)
sra_studies <- read_parquet(
"https://data-omicidx.cancerdatasci.org/sra/parquet/sra_studies.parquet"
)CLI (DuckDB)
duckdb -c "
SELECT count(*) as total_runs
FROM 'https://data-omicidx.cancerdatasci.org/sra/parquet/sra_runs.parquet'
"All files are served from https://data-omicidx.cancerdatasci.org/:
| File | Path |
|---|---|
| SRA Studies | sra/parquet/sra_studies.parquet |
| SRA Samples | sra/parquet/sra_samples.parquet |
| SRA Experiments | sra/parquet/sra_experiments.parquet |
| SRA Runs | sra/parquet/sra_runs.parquet |
| SRA Accessions | sra/parquet/sra_accessions.parquet |
| GEO Series | geo/parquet/geo_series.parquet |
| GEO Samples | geo/parquet/geo_samples.parquet |
| GEO Platforms | geo/parquet/geo_platforms.parquet |
| BioSamples | biosample/parquet/biosamples.parquet |
| BioProjects | bioproject/parquet/bioprojects.parquet |
| PubMed | pubmed/raw/pubmed*.parquet |
| Repo | Description |
|---|---|
| omicidx-etl | Automated ETL pipelines — GitHub Actions + systemd orchestration, daily updates with full visibility into pipeline status |
| omicidx-parsers | NCBI XML parsing library with Pydantic v2 models (PyPI: omicidx) |
| omicidx-mcp | MCP server — lets AI assistants (Claude, Cursor) query OmicIDX data via natural language SQL |
The omicidx-etl repo orchestrates daily data updates with full transparency:
- Daily GitHub Actions workflows for SRA, GEO, BioSample, BioProject, PubMed, and EBI BioSamples
- Incremental updates — fetches only new/changed records where possible, deduplicates automatically
- SQL transformation layer — DuckDB-based consolidation from raw partitioned files into final Parquet tables
- Pipeline visibility — all workflow runs are public on GitHub Actions; check the Actions tab for current status
omicidx-mcp exposes the full OmicIDX dataset to AI coding assistants through the Model Context Protocol. Connect it to Claude, Cursor, or any MCP-compatible tool and query genomics metadata with natural language — no SQL expertise required.