Skip to content

lparlett/health-records-collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Health Records Collection

Python 3.12+ Streamlit SQLite License: MIT AI-assisted with Codex DOI

Tools for unifying personal electronic health record (EHR) exports into a local SQLite database and exploring them with a Streamlit dashboard. The repository contains no protected health information; the ingest pipeline expects you to provide your own CCD exports. Portions of the scaffolding were drafted with generative AI and reviewed by human maintainers - see the full AI disclosure for details.


Quick Start

Requirements

  • Python 3.12 or newer
  • SQLite (bundled with Python)
  • Streamlit-compatible browser (Chrome, Edge, Firefox, Safari)

Setup

git clone <repo-url>
cd Health-Records-Collection

python -m venv .venv
.venv\Scripts\Activate.ps1   # Windows PowerShell
# or: source .venv/bin/activate   # macOS/Linux

pip install --upgrade pip
pip install -r requirements.txt

Ingest and Explore

  1. Drop each CCD ZIP export into data/raw/.

  2. Run the ingestion workflow:

    python ingest.py

    This creates or refreshes db/health_records.db, extracts ZIP contents into data/parsed/, and populates all supported tables.

    • Add --log-level debug to surface detailed troubleshooting messages while you iterate:

      python ingest.py --log-level debug
    • To capture logs without printing patient identifiers to the console, direct output to a file:

      python ingest.py --log-level info --log-file logs/ingest.log

      Debug logs include richer context, so avoid enabling them on shared systems.

  3. Launch the dashboard:

    streamlit run frontend/app.py

    Streamlit opens at http://localhost:8501 with an encounter overview, table browser, and SQL scratchpad.


How It Works

  • Ingestion pipeline (ingest.py)

    • Unzips CCD packages from data/raw/ into data/parsed/ (skipping extracts that already exist).
  • Parses XML with lxml using modular parsers in parsers/ for patients, encounters, allergies, conditions, medications, labs, procedures, vitals, immunizations, progress notes, and insurance coverage.

    • Records file-level provenance in the data_source table (original filename, archive, SHA256 hash, creation time, repository ID, and author institution pulled from XDM METADATA.XML) and threads the resulting identifier through every downstream insert.
    • Normalizes providers, deduplicates medications and immunizations, and invokes service modules in services/ to load data into SQLite.
    • Applies schema migrations on the fly via db/schema.py to keep older databases compatible.
  • Streamlit dashboard (frontend/)

    • views.py renders an Encounter Overview with expandable visit summaries, including diagnoses and medications.
    • Sidebar controls let you pick tables to preview using reusable widgets in ui_components.py.
    • A SQL query box allows ad-hoc exploration; results render with native Streamlit dataframes.
    • Connection utilities in db_utils.py keep the UI responsive with row limits and read-only access.
    • XML files are rendered using the HL7 CDA Core Stylesheet, automatically updated weekly from the official repository with proper attribution.
  • Schema & services (schema.sql, services/)

    • schema.sql defines core tables for patients, providers, encounters, medications, lab results, allergies, insurance coverage, conditions (with codes), procedures, vitals, immunizations, attachments, progress notes, and the ingested_archive registry used to track archive hashes and ingestion counts, each linking back to enriched data_source metadata (now including source_archive_id foreign keys to ingested_archive).
    • Service modules encapsulate insert logic, deduplication, and foreign key wiring for each domain. services/data_sources.py manages provenance rows so other modules can reference a shared data_source_id, while services/archives.py records archive hashes so duplicate uploads can be flagged safely.
    • db/schema.py backfills missing columns, normalizes provider records, and adds protective indexes.

Configuration

Use the Settings view in the Streamlit sidebar to update the raw, parsed, and database paths. Overrides are saved to user/settings.yaml and the app automatically reloads after changes.

External Resources

  • CDA Rendering

    • This project uses the HL7 CDA Core Stylesheet for rendering CDA XML documents, which is maintained in a separate repository and automatically updated via GitHub Actions. The stylesheet files are included under the Apache 2.0 license with proper attribution.
  • Color Palette -Coolors.co


Repository Layout

data/               Raw ZIP exports (`raw/`) and extracted XML (`parsed/`)
db/                 SQLite artifacts (`health_records.db`) and schema helpers
frontend/           Streamlit application entry point, views, and utilities
parsers/            CCD XML parsers grouped by domain
services/           Persistence helpers for each domain table
tests/              Pytest suite covering parsers, services,
                    schema, and ingest flow
ingest.py           Command-line ingestion workflow
schema.sql          Canonical database definition
requirements.txt    Locked Python dependencies

Configuration & Customization

  • Update frontend/config.yaml to change the dashboard title, layout, database path, or default row limits.
  • Extend parsing coverage by adding new modules in parsers/ and wiring them into ingest.py.
  • Modify or append tables by editing schema.sql and enhancing db/schema.py to enforce migrations.
  • Regenerate the database at any time by deleting db/health_records.db and rerunning python ingest.py.
  • Control ingestion verbosity per run with --log-level {error,warning,info,debug} and optionally persist output via --log-file path/to/logs.txt.

Development

  • Run the automated tests with:

    pytest
  • The project targets Python 3.12; please keep new dependencies pinned in requirements.txt.

  • Follow the contributor guidelines in CONTRIBUTING.md and report security concerns per SECURITY.md.


License

MIT License. See LICENSE for full terms.

About

A collection of scripts to put your health data in one place locally.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published