Author: Andrew Lee | UTHealth Houston SBMI
Stack: Python (ClinicalBERT → UMAP → HDBSCAN → Plotly)
Dataset: MIMIC-IV (PhysioNet credentialed access)
Status: Reproducible pipeline — v1.0
This project generates dense vector representations of ICD-10 diagnosis codes using a pretrained clinical language model (ClinicalBERT / Bio_ClinicalBERT) and applies unsupervised clustering to discover latent patient phenotypes from admission diagnosis patterns.
The pipeline operationalizes a clinically meaningful question: Do ICD-10 code embeddings, when reduced and clustered, recover clinically coherent disease phenotypes without supervision? The answer has implications for cohort discovery, disease progression modeling, and computational phenotyping at scale.
Can ClinicalBERT-derived embeddings of ICD-10 code descriptions, combined with MIMIC-IV admission diagnosis sequences, produce clinically coherent phenotype clusters — and do these clusters predict 30-day readmission differentially?
icd10-phenotype/
├── src/
│ ├── embeddings/
│ │ ├── icd10_description_embedder.py # ClinicalBERT → per-code embeddings
│ │ ├── patient_embedding_aggregator.py # Per-admission embedding aggregation
│ │ └── embedding_cache.py # Disk-based embedding cache
│ ├── clustering/
│ │ ├── dimensionality_reduction.py # UMAP (2D viz + 50D clustering)
│ │ ├── hdbscan_clustering.py # HDBSCAN phenotype discovery
│ │ └── cluster_profiler.py # Cluster characterization
│ └── visualization/
│ ├── plotly_umap.py # Interactive 2D UMAP scatter
│ ├── phenotype_heatmap.py # Cluster × feature heatmap
│ └── readmission_by_cluster.py # Readmission rate per phenotype
├── data/
│ ├── raw/ # MIMIC-IV ICD-10 extracts
│ └── processed/ # Embeddings, clusters, profiles
├── notebooks/
│ ├── 01_embedding_exploration.ipynb
│ └── 02_phenotype_analysis.ipynb
├── models/ # Saved UMAP/HDBSCAN models
├── docs/figures/ # Output plots
├── requirements.txt
└── README.md
MIMIC-IV diagnoses_icd
│
▼
[ICD-10 Code List + CMS Descriptions]
│──── ICD-10 CM code → plain-text description mapping
│──── ClinicalBERT tokenizer + model
│──── 768-dim embedding per ICD-10 code (mean pooling)
│──── Cache to disk (avoid re-encoding on rerun)
▼
[Patient-Level Aggregation]
│──── Per-admission: collect all ICD-10 codes
│──── Aggregate strategy: mean-pool code embeddings
│ (weighted by diagnosis sequence position)
│──── Output: 768-dim admission-level embedding
▼
[Dimensionality Reduction — UMAP]
│──── 768 → 50 dims (UMAP for clustering)
│──── 768 → 2 dims (UMAP for visualization)
│──── Fit on train split only; transform val/test
▼
[Clustering — HDBSCAN]
│──── min_cluster_size tuned via DBCV score
│──── Soft cluster probabilities
│──── Noise point handling (label = -1)
▼
[Cluster Profiling]
│──── Top ICD-10 codes per cluster
│──── Mean Elixhauser score, age, LOS per cluster
│──── Readmission rate per cluster
│──── Silhouette + DBCV validation
▼
[Visualization]
──── Plotly interactive UMAP scatter (colored by cluster)
──── Readmission rate bar chart per phenotype
──── Cluster × comorbidity heatmap
- Primary:
emilyalsentzer/Bio_ClinicalBERT(trained on MIMIC clinical notes) - Fallback:
microsoft/BiomedNLP-PubMedBERT-base-unchecked-abstract-fulltext
Each ICD-10 code is embedded from its plain-text description (CMS FY2024):
ICD-10: I50.32
Description: "Chronic diastolic (congestive) heart failure"
→ ClinicalBERT tokenize → mean-pool last hidden states → [768-dim vector]
For each admission with diagnoses [d₁, d₂, ..., dₙ]:
patient_emb = weighted_mean([emb(d₁)·w₁, emb(d₂)·w₂, ..., emb(dₙ)·wₙ])
where wᵢ = 1/(seq_num)^0.5 (down-weight secondary diagnoses)
| Metric | Description | Target |
|---|---|---|
| DBCV score | Density-based cluster validity | > 0.3 |
| Silhouette (50D) | Within vs. between cluster separation | > 0.25 |
| Noise fraction | % samples labeled noise by HDBSCAN | < 15% |
| Clinical coherence | Expert review of top-10 ICD codes per cluster | Qualitative |
Based on prior ICD embedding literature (Choi et al. 2016; Razavian et al. 2016):
| Cluster (expected) | Dominant ICD-10 Codes | Expected Readmit Rate |
|---|---|---|
| Cardiovascular-metabolic | I50.x, E11.x, N18.x | ~22–28% |
| Oncology/hematologic | C-codes, D-codes | ~18–25% |
| Respiratory | J44.x, J96.x | ~20–26% |
| Psychiatric/SUD | F-codes | ~15–20% |
| Musculoskeletal/orthopedic | M-codes, S-codes | ~8–12% |
| Gastrointestinal | K-codes | ~12–16% |
pip install -r requirements.txt
# Step 1: Generate per-code ClinicalBERT embeddings
python src/embeddings/icd10_description_embedder.py
# Step 2: Aggregate to patient-level embeddings
python src/embeddings/patient_embedding_aggregator.py
# Step 3: Dimensionality reduction
python src/clustering/dimensionality_reduction.py
# Step 4: HDBSCAN clustering
python src/clustering/hdbscan_clustering.py
# Step 5: Cluster profiling and visualization
python src/clustering/cluster_profiler.py
python src/visualization/plotly_umap.pyLee, A. G. (2025). ICD-10 diagnosis code embedding and phenotype clustering using ClinicalBERT and MIMIC-IV. GitHub. https://github.com/SaeMind/icd10-phenotype
MIT. MIMIC-IV governed by PhysioNet Credentialed Health Data License 1.5.0.