CURA: Calibrated Uncertainty with Retrieval and Agents

CURA is a reliability-first multimodal medical decision-support framework. It couples knowledge-graph grounded retrieval with a heterogeneous clinical council, fuses opinions via Bayesian belief aggregation, and exposes a calibrated posterior to downstream decision policies — selective abstention, split conformal prediction sets, entropy-gated escalation, and uncertainty-driven interactive diagnosis.

Core modules (cura/):

kg/ — Co-evolutionary LLM↔KG grounding: entity extraction, embedding-based matching, Monte Carlo multi-hop path sampling, evidence reranking
expert_panel.py — Heterogeneous clinical council with specialist personas (sequential and adaptive modes)
aggregation/bayesian.py — Bayesian belief aggregation in log-space with normalized Shannon entropy
aggregation/conformal.py — Split conformal prediction sets with Clopper-Pearson CIs
aggregation/selective.py — Selective abstention (risk-coverage curves, AURC)
aggregation/calibration.py — ECE computation and calibration diagnostics
llm.py — Unified LLM interface (OpenAI, Azure, Anthropic, AWS Bedrock)
usage_tracker.py — Token and cost tracking across all agents

Setup

# Environment
cd cura_env && bash setup.sh
conda activate cura_e1

# Install
pip install -e .

# API keys
cp .env.example .env  # fill in your keys

Requires at least one LLM provider key (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.). For Azure deployments, prefix model names with azure-.

KG backend requires Neo4j. See scripts/setup_neo4j.py and config/kg_settings.json.

Usage

from cura.agent import A1

agent = A1(path='./data', llm='azure-gpt-4.1')
result = agent.go("Diagnose this patient presenting with...")

Benchmarks

# Medical QA (MedQA, MedMCQA, MMLU-Medical, QA4MRE)
python benchmark/run_benchmark.py --dataset medqa --model azure-gpt-4.1

# AgentClinic interactive diagnosis (MedQA, NEJM scenarios)
python benchmark/run_agentclinic.py --dataset NEJM_Ext --n_scenarios 100 \
    --mode baseline_single --max_turns 10 \
    --commit_pmax 0.85 --commit_entropy 0.3

# VQA-RAD multimodal
python benchmark/run_benchmark.py --dataset vqa_rad --model azure-gpt-4.1

Results are saved to experiments/results/ with per-case JSONL, metrics CSV, and run metadata JSON.

Key Scripts

Script	Purpose
`benchmark/run_benchmark.py`	Run QA benchmarks with optional expert panel
`benchmark/run_agentclinic.py`	Interactive diagnosis (immediate / fixed-turn / uncertainty-driven)
`scripts/setup_neo4j.py`	Initialize Neo4j KG backend
`scripts/build_benchmark_kg.py`	Build domain KGs from literature
`scripts/plot_reliability_diagram.py`	Generate calibration and reliability plots
`scripts/run_ablation_diversity.py`	Council size / diversity ablations

Tests

python tests/run_all_tests.py

License

Apache 2.0. See LICENSE.

Acknowledgements

Built on LangGraph and the Biomni framework.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
cura		cura
cura_env		cura_env
docs		docs
experiments		experiments
scripts		scripts
static		static
.DS_Store		.DS_Store
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CURA: Calibrated Uncertainty with Retrieval and Agents

Setup

Usage

Benchmarks

Key Scripts

Tests

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CURA: Calibrated Uncertainty with Retrieval and Agents

Setup

Usage

Benchmarks

Key Scripts

Tests

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages