Skip to content

SaeMind/icd10-phenotype

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

ICD-10 Diagnosis Code Embedding and Phenotype Clustering

Author: Andrew Lee | UTHealth Houston SBMI
Stack: Python (ClinicalBERT → UMAP → HDBSCAN → Plotly)
Dataset: MIMIC-IV (PhysioNet credentialed access)
Status: Reproducible pipeline — v1.0


Project Overview

This project generates dense vector representations of ICD-10 diagnosis codes using a pretrained clinical language model (ClinicalBERT / Bio_ClinicalBERT) and applies unsupervised clustering to discover latent patient phenotypes from admission diagnosis patterns.

The pipeline operationalizes a clinically meaningful question: Do ICD-10 code embeddings, when reduced and clustered, recover clinically coherent disease phenotypes without supervision? The answer has implications for cohort discovery, disease progression modeling, and computational phenotyping at scale.

Research Question

Can ClinicalBERT-derived embeddings of ICD-10 code descriptions, combined with MIMIC-IV admission diagnosis sequences, produce clinically coherent phenotype clusters — and do these clusters predict 30-day readmission differentially?


Repository Structure

icd10-phenotype/
├── src/
│   ├── embeddings/
│   │   ├── icd10_description_embedder.py  # ClinicalBERT → per-code embeddings
│   │   ├── patient_embedding_aggregator.py # Per-admission embedding aggregation
│   │   └── embedding_cache.py              # Disk-based embedding cache
│   ├── clustering/
│   │   ├── dimensionality_reduction.py     # UMAP (2D viz + 50D clustering)
│   │   ├── hdbscan_clustering.py           # HDBSCAN phenotype discovery
│   │   └── cluster_profiler.py             # Cluster characterization
│   └── visualization/
│       ├── plotly_umap.py                  # Interactive 2D UMAP scatter
│       ├── phenotype_heatmap.py            # Cluster × feature heatmap
│       └── readmission_by_cluster.py       # Readmission rate per phenotype
├── data/
│   ├── raw/                                # MIMIC-IV ICD-10 extracts
│   └── processed/                          # Embeddings, clusters, profiles
├── notebooks/
│   ├── 01_embedding_exploration.ipynb
│   └── 02_phenotype_analysis.ipynb
├── models/                                 # Saved UMAP/HDBSCAN models
├── docs/figures/                           # Output plots
├── requirements.txt
└── README.md

Pipeline Architecture

MIMIC-IV diagnoses_icd
        │
        ▼
[ICD-10 Code List + CMS Descriptions]
        │──── ICD-10 CM code → plain-text description mapping
        │──── ClinicalBERT tokenizer + model
        │──── 768-dim embedding per ICD-10 code (mean pooling)
        │──── Cache to disk (avoid re-encoding on rerun)
        ▼
[Patient-Level Aggregation]
        │──── Per-admission: collect all ICD-10 codes
        │──── Aggregate strategy: mean-pool code embeddings
        │     (weighted by diagnosis sequence position)
        │──── Output: 768-dim admission-level embedding
        ▼
[Dimensionality Reduction — UMAP]
        │──── 768 → 50 dims (UMAP for clustering)
        │──── 768 → 2 dims  (UMAP for visualization)
        │──── Fit on train split only; transform val/test
        ▼
[Clustering — HDBSCAN]
        │──── min_cluster_size tuned via DBCV score
        │──── Soft cluster probabilities
        │──── Noise point handling (label = -1)
        ▼
[Cluster Profiling]
        │──── Top ICD-10 codes per cluster
        │──── Mean Elixhauser score, age, LOS per cluster
        │──── Readmission rate per cluster
        │──── Silhouette + DBCV validation
        ▼
[Visualization]
        ──── Plotly interactive UMAP scatter (colored by cluster)
        ──── Readmission rate bar chart per phenotype
        ──── Cluster × comorbidity heatmap

Embedding Strategy

Model

  • Primary: emilyalsentzer/Bio_ClinicalBERT (trained on MIMIC clinical notes)
  • Fallback: microsoft/BiomedNLP-PubMedBERT-base-unchecked-abstract-fulltext

ICD-10 Code → Embedding

Each ICD-10 code is embedded from its plain-text description (CMS FY2024):

ICD-10: I50.32
Description: "Chronic diastolic (congestive) heart failure"
→ ClinicalBERT tokenize → mean-pool last hidden states → [768-dim vector]

Patient Embedding Aggregation

For each admission with diagnoses [d₁, d₂, ..., dₙ]:

patient_emb = weighted_mean([emb(d₁)·w₁, emb(d₂)·w₂, ..., emb(dₙ)·wₙ])

where wᵢ = 1/(seq_num)^0.5   (down-weight secondary diagnoses)

Cluster Validation

Metric Description Target
DBCV score Density-based cluster validity > 0.3
Silhouette (50D) Within vs. between cluster separation > 0.25
Noise fraction % samples labeled noise by HDBSCAN < 15%
Clinical coherence Expert review of top-10 ICD codes per cluster Qualitative

Expected Phenotype Clusters

Based on prior ICD embedding literature (Choi et al. 2016; Razavian et al. 2016):

Cluster (expected) Dominant ICD-10 Codes Expected Readmit Rate
Cardiovascular-metabolic I50.x, E11.x, N18.x ~22–28%
Oncology/hematologic C-codes, D-codes ~18–25%
Respiratory J44.x, J96.x ~20–26%
Psychiatric/SUD F-codes ~15–20%
Musculoskeletal/orthopedic M-codes, S-codes ~8–12%
Gastrointestinal K-codes ~12–16%

Installation and Usage

pip install -r requirements.txt

# Step 1: Generate per-code ClinicalBERT embeddings
python src/embeddings/icd10_description_embedder.py

# Step 2: Aggregate to patient-level embeddings
python src/embeddings/patient_embedding_aggregator.py

# Step 3: Dimensionality reduction
python src/clustering/dimensionality_reduction.py

# Step 4: HDBSCAN clustering
python src/clustering/hdbscan_clustering.py

# Step 5: Cluster profiling and visualization
python src/clustering/cluster_profiler.py
python src/visualization/plotly_umap.py

Citation

Lee, A. G. (2025). ICD-10 diagnosis code embedding and phenotype clustering using ClinicalBERT and MIMIC-IV. GitHub. https://github.com/SaeMind/icd10-phenotype


License

MIT. MIMIC-IV governed by PhysioNet Credentialed Health Data License 1.5.0.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages