A lightweight, safety-oriented evaluation harness for testing how LLMs behave in controlled clinical decision-support scenarios.
This repository is an evaluation artifact. It is not a medical device, not a clinical product, and not for patient care.
Checked-in canonical published run, from results/run_manifest.json and results/summary.md:
| Field | Current checked-in value |
|---|---|
| Provider / model | openai / gpt-4o |
| Run ID | 20260305_045410 |
| Scored cases | 25 / 25 |
| PASS / WARN / FAIL | 22 / 3 / 0 |
| Unsafe recommendation rate | 0.0% |
| Hallucination suspicion rate | 0.0% |
| Refusal failure rate | 0.0% |
| Mean faithfulness proxy | 0.866 |
| Mean uncertainty alignment | 0.932 |
Guardrail: these are heuristic evaluator outputs for one explicit published run. They are not evidence of clinical safety or deployment readiness. The checked-in published artifacts reflect the current stricter evaluator rules, including non-empty section checks and rationale-scoped required citations.
Historical raw generations used for cache/reproducibility are stored separately under results/cache/ and are not the published benchmark result set.
For the fastest review path:
- First click:
results/summary.mdfor the scorecard, safety-style rates, failure tags, and worst cases. - Read
docs/REVIEWER_WORKFLOW.mdfor artifact trust boundaries and review order. - Check
docs/notable_failures.mdfor representative WARN cases and scoring-boundary notes. - Use
docs/safety_case.mdfor hazard framing, mitigations, and non-claims. - Use
docs/artifacts_guide.mdwhen inspecting individual result files.
This project simulates a pre-deployment healthcare AI evaluation workflow:
- run a fixed clinical evaluation dataset
- generate structured model responses from a fixed prompt template
- score outputs with safety- and faithfulness-oriented heuristics
- surface flagged cases for human review
- summarize benchmark artifacts for reviewer inspection
The goal is not to build a medical model. The goal is to build a credible evaluation sandbox that shows how a healthcare AI team might risk-test an LLM before workflow integration.
The benchmark uses structured clinical scenarios and checks whether a model:
- answers from the provided context instead of inventing facts
- cites the allowed context anchors
- expresses uncertainty or refuses when evidence is insufficient
- avoids forbidden or unsafe actions
- follows a response format that is easy to inspect and score
This repository does not claim:
- clinical validity
- clinician-grade adjudication
- regulatory readiness
- complete safety coverage
- real-world deployment approval
Faithfulness and safety checks are heuristic by design. Results should be interpreted as a controlled evaluation artifact, not as proof that a model is clinically safe.
Clinical AI evaluation cannot rely on accuracy alone. This sandbox focuses on signals that matter in healthcare settings:
- faithfulness to provided context
- citation validity
- uncertainty and refusal behavior
- unsafe recommendation detection
- reviewer-friendly failure analysis
The repo is intentionally small, auditable, and governance-oriented rather than production-complete.
clinical-AI-eval_sandbox/
├── dataset/
│ └── clinical_questions.csv # Fixed evaluation cases
├── src/
│ ├── prompt_templates.py # Prompt template used for all cases
│ ├── llm_clients.py # Provider adapters
│ ├── generate_answers.py # Runs model generation and caches outputs
│ ├── metrics.py # Scoring and safety flags
│ ├── run_evaluation.py # Applies metrics to generations
│ ├── summarize_results.py # Builds markdown summary
│ └── build_reviewer_report.py # Builds derived local reviewer package
├── results/
│ ├── raw_generations.jsonl # One explicit published provider/model/run
│ ├── run_manifest.json # Published run identity and provenance
│ ├── evaluation_output.csv # Scored case-level results
│ ├── flagged_cases.jsonl # WARN/FAIL subset for review
│ ├── summary.md # Human-readable run summary
│ └── cache/
│ └── raw_generations_cache.jsonl # Reusable raw-generation cache/history store
├── docs/
│ ├── architecture.md # System overview
│ ├── artifacts_guide.md # File-by-file artifact guide
│ ├── REVIEWER_WORKFLOW.md # Artifact trust map and review order
│ ├── reviewer_package.md # Derived reviewer package boundaries and usage
│ ├── reviewer_guide.md # Fast reviewer walkthrough
│ ├── results_interpretation.md # Benchmark interpretation guidance
│ ├── safety_case.md # Safety framing and hazards
│ ├── failure_modes.md # Failure taxonomy and known limitations
│ ├── notable_failures.md # Representative cases
│ ├── CODEX_RUNBOOK.md # Repo-local Codex workflow
│ └── maintenance_boundaries.md # Eval-sensitive change policy
├── AGENTS.md # Codex operating constraints
├── Makefile # Local verification helpers
├── requirements.txt
└── README.md
dataset/clinical_questions.csv
-> src/prompt_templates.py
-> src/generate_answers.py
-> results/raw_generations.jsonl + results/run_manifest.json
-> src/run_evaluation.py + src/metrics.py
-> results/evaluation_output.csv + results/flagged_cases.jsonl
-> src/summarize_results.py
-> results/summary.md
The main review artifacts are:
results/raw_generations.jsonl: raw prompts, answers, and metadata for the one published runresults/run_manifest.json: the explicit provider / model / run_id backing the public artifactsresults/evaluation_output.csv: case-level metrics, flags, and PASS/WARN/FAIL gradesresults/flagged_cases.jsonl: subset for manual inspection of concerning outputsresults/summary.md: compact benchmark report with rates, means, and worst casesresults/cache/raw_generations_cache.jsonl: reusable raw-generation cache/history store that is not itself the public benchmark set
The reviewer package is a generated convenience view, not a canonical benchmark artifact. It is derived from completed-run artifacts without changing scoring, prompts, datasets, thresholds, tags, metrics definitions, or published artifact meaning.
To generate it locally:
make reviewer-packageEquivalent direct command:
python src/build_reviewer_report.py --results-dir resultsThen open reviewer_packages/<provider>_<model_id>_<run_id>/reviewer_report.html in a browser. The package is ignored by git and also includes reviewer_summary.json, a machine-readable derived summary that mirrors the HTML sections. The generator validates run identity and flagged-case overlap before rendering.
This repo separates offline verification, exploratory sandbox runs, and published benchmark candidates.
The Offline Verification workflow compiles the repo, runs the unit tests, regenerates the published run from results/cache/raw_generations_cache.jsonl, and checks that the public artifacts reproduce exactly.
For a fast local health check before reviewing deeper:
python -m unittest discover -s tests -v
python -m py_compile src/*.py tests/*.pyThe same checks are available through:
make verifyRepo-local Codex guidance lives in AGENTS.md, with the operational runbook in docs/CODEX_RUNBOOK.md.
Future Codex sessions should classify changes as docs-only maintenance, derived tooling, sandbox support, benchmark revision, or result refresh before editing. Benchmark-defining files and checked-in results/ artifacts should only change when that is the explicit task.
The Clinical AI Eval (Sandbox Run) workflow is the API-backed path for exploratory runs.
Use it for:
- partial-dataset smoke tests
- prompt iteration checks
- provider comparisons
mock-provider validation runs
Sandbox runs write to sandbox_results/ inside the workflow and upload artifacts for review. They do not overwrite results/.
The Clinical AI Eval (Published Benchmark Candidate) workflow is the guarded path for generating a full-dataset benchmark candidate for manual review.
It:
- forces the full dataset
- rejects the
mockprovider - runs compile + unit-test checks first
- generates a live run, then rebuilds the candidate artifact set from cache
- verifies exact reproducibility before uploading the candidate artifacts
Published candidates are uploaded for manual review rather than pushed directly back to the repo.
Expected workflow inputs:
| Input | Example | Description |
|---|---|---|
model |
gpt-4o |
Model used for generation |
prompt_version |
v1 |
Prompt label tracked in artifacts |
run_id |
20260330_candidate |
Explicit benchmark-candidate run identifier |
If a reviewer wants to inspect the mechanics, the main scripts are:
src/generate_answers.pysrc/run_evaluation.pysrc/summarize_results.pysrc/build_reviewer_report.py
src/generate_answers.py supports --run-kind sandbox, --run-kind candidate, and --run-kind published.
Use sandbox for exploratory or partial runs, candidate for full-dataset review artifacts, and published only for the checked-in canonical benchmark set and offline reproducibility.
Supported generation providers for src/generate_answers.py:
openaiusingOPENAI_API_KEYanthropicusingANTHROPIC_API_KEYgeminiusingGEMINI_API_KEYorGOOGLE_API_KEYmockfor deterministic pipeline validation without API access
The multi-provider support above exists at the generation-script layer. The checked-in GitHub Actions workflows currently wire openai for API-backed runs and mock for pipeline validation; using anthropic or gemini in CI would require extending workflow secrets and inputs.
docs/architecture.md: architecture, modules, and data flowdocs/artifacts_guide.md: what each results artifact contains and how to read itdocs/REVIEWER_WORKFLOW.md: step-by-step artifact review order and source-of-truth guidancedocs/reviewer_package.md: derived reviewer package usage, source dependencies, and boundariesdocs/results_interpretation.md: how to interpret benchmark outputs and model comparisons responsiblydocs/safety_case.md: safety framing, hazards, and mitigationsdocs/failure_modes.md: common failure categories plus known v1 limitationsdocs/notable_failures.md: representative flagged casesdocs/reviewer_guide.md: quick walkthrough for interviewers and other reviewersdocs/CODEX_RUNBOOK.md: repo-local operating workflow for future Codex sessionsdocs/maintenance_boundaries.md: what should not be edited casually because it can change benchmark meaning
The following files are benchmark-sensitive and should be treated as protected unless a benchmark revision is explicitly intended:
dataset/clinical_questions.csvsrc/prompt_templates.pysrc/metrics.pysrc/run_evaluation.pysrc/generate_answers.pyresults/run_manifest.jsonresults/summary.mdresults/evaluation_output.csvresults/flagged_cases.jsonlresults/raw_generations.jsonl
See docs/maintenance_boundaries.md for the maintenance policy used in this repo.
- The dataset is intentionally small and reviewable.
- Safety flags are heuristic and incomplete.
- Reported results are one explicit published provider / model / run, not universal model judgments.
- Human clinical review is outside the automated pipeline.
- Historical cached raw generations are kept separate from the published benchmark result set.
It demonstrates:
- healthcare AI evaluation framing
- safety-aware benchmark design
- structured prompt and scoring discipline
- honest limitations and governance thinking
- reviewer-friendly artifact organization
This repository demonstrates evaluation methods for healthcare AI systems. It must not be used to provide medical advice, support patient care, or make clinical decisions.