CURA is a reliability-first multimodal medical decision-support framework. It couples knowledge-graph grounded retrieval with a heterogeneous clinical council, fuses opinions via Bayesian belief aggregation, and exposes a calibrated posterior to downstream decision policies — selective abstention, split conformal prediction sets, entropy-gated escalation, and uncertainty-driven interactive diagnosis.
Core modules (cura/):
kg/— Co-evolutionary LLM↔KG grounding: entity extraction, embedding-based matching, Monte Carlo multi-hop path sampling, evidence rerankingexpert_panel.py— Heterogeneous clinical council with specialist personas (sequential and adaptive modes)aggregation/bayesian.py— Bayesian belief aggregation in log-space with normalized Shannon entropyaggregation/conformal.py— Split conformal prediction sets with Clopper-Pearson CIsaggregation/selective.py— Selective abstention (risk-coverage curves, AURC)aggregation/calibration.py— ECE computation and calibration diagnosticsllm.py— Unified LLM interface (OpenAI, Azure, Anthropic, AWS Bedrock)usage_tracker.py— Token and cost tracking across all agents
# Environment
cd cura_env && bash setup.sh
conda activate cura_e1
# Install
pip install -e .
# API keys
cp .env.example .env # fill in your keysRequires at least one LLM provider key (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.). For Azure deployments, prefix model names with azure-.
KG backend requires Neo4j. See scripts/setup_neo4j.py and config/kg_settings.json.
from cura.agent import A1
agent = A1(path='./data', llm='azure-gpt-4.1')
result = agent.go("Diagnose this patient presenting with...")# Medical QA (MedQA, MedMCQA, MMLU-Medical, QA4MRE)
python benchmark/run_benchmark.py --dataset medqa --model azure-gpt-4.1
# AgentClinic interactive diagnosis (MedQA, NEJM scenarios)
python benchmark/run_agentclinic.py --dataset NEJM_Ext --n_scenarios 100 \
--mode baseline_single --max_turns 10 \
--commit_pmax 0.85 --commit_entropy 0.3
# VQA-RAD multimodal
python benchmark/run_benchmark.py --dataset vqa_rad --model azure-gpt-4.1Results are saved to experiments/results/ with per-case JSONL, metrics CSV, and run metadata JSON.
| Script | Purpose |
|---|---|
benchmark/run_benchmark.py |
Run QA benchmarks with optional expert panel |
benchmark/run_agentclinic.py |
Interactive diagnosis (immediate / fixed-turn / uncertainty-driven) |
scripts/setup_neo4j.py |
Initialize Neo4j KG backend |
scripts/build_benchmark_kg.py |
Build domain KGs from literature |
scripts/plot_reliability_diagram.py |
Generate calibration and reliability plots |
scripts/run_ablation_diversity.py |
Council size / diversity ablations |
python tests/run_all_tests.pyApache 2.0. See LICENSE.