Skip to content

NickLeko/clinical-AI-eval_sandbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clinical AI Evaluation Sandbox

A lightweight, safety-oriented evaluation harness for testing how LLMs behave in controlled clinical decision-support scenarios.

This repository is an evaluation artifact. It is not a medical device, not a clinical product, and not for patient care.

Published Run Snapshot

Checked-in canonical published run, from results/run_manifest.json and results/summary.md:

Field Current checked-in value
Provider / model openai / gpt-4o
Run ID 20260305_045410
Scored cases 25 / 25
PASS / WARN / FAIL 22 / 3 / 0
Unsafe recommendation rate 0.0%
Hallucination suspicion rate 0.0%
Refusal failure rate 0.0%
Mean faithfulness proxy 0.866
Mean uncertainty alignment 0.932

Guardrail: these are heuristic evaluator outputs for one explicit published run. They are not evidence of clinical safety or deployment readiness. The checked-in published artifacts reflect the current stricter evaluator rules, including non-empty section checks and rationale-scoped required citations.

Historical raw generations used for cache/reproducibility are stored separately under results/cache/ and are not the published benchmark result set.

Start Here

For the fastest review path:

  1. First click: results/summary.md for the scorecard, safety-style rates, failure tags, and worst cases.
  2. Read docs/REVIEWER_WORKFLOW.md for artifact trust boundaries and review order.
  3. Check docs/notable_failures.md for representative WARN cases and scoring-boundary notes.
  4. Use docs/safety_case.md for hazard framing, mitigations, and non-claims.
  5. Use docs/artifacts_guide.md when inspecting individual result files.

What This Project Is

This project simulates a pre-deployment healthcare AI evaluation workflow:

  • run a fixed clinical evaluation dataset
  • generate structured model responses from a fixed prompt template
  • score outputs with safety- and faithfulness-oriented heuristics
  • surface flagged cases for human review
  • summarize benchmark artifacts for reviewer inspection

The goal is not to build a medical model. The goal is to build a credible evaluation sandbox that shows how a healthcare AI team might risk-test an LLM before workflow integration.

What It Evaluates

The benchmark uses structured clinical scenarios and checks whether a model:

  • answers from the provided context instead of inventing facts
  • cites the allowed context anchors
  • expresses uncertainty or refuses when evidence is insufficient
  • avoids forbidden or unsafe actions
  • follows a response format that is easy to inspect and score

What It Does Not Claim

This repository does not claim:

  • clinical validity
  • clinician-grade adjudication
  • regulatory readiness
  • complete safety coverage
  • real-world deployment approval

Faithfulness and safety checks are heuristic by design. Results should be interpreted as a controlled evaluation artifact, not as proof that a model is clinically safe.

Why It Matters

Clinical AI evaluation cannot rely on accuracy alone. This sandbox focuses on signals that matter in healthcare settings:

  • faithfulness to provided context
  • citation validity
  • uncertainty and refusal behavior
  • unsafe recommendation detection
  • reviewer-friendly failure analysis

The repo is intentionally small, auditable, and governance-oriented rather than production-complete.

2-Minute Repo Map

clinical-AI-eval_sandbox/
├── dataset/
│   └── clinical_questions.csv      # Fixed evaluation cases
├── src/
│   ├── prompt_templates.py         # Prompt template used for all cases
│   ├── llm_clients.py              # Provider adapters
│   ├── generate_answers.py         # Runs model generation and caches outputs
│   ├── metrics.py                  # Scoring and safety flags
│   ├── run_evaluation.py           # Applies metrics to generations
│   ├── summarize_results.py        # Builds markdown summary
│   └── build_reviewer_report.py    # Builds derived local reviewer package
├── results/
│   ├── raw_generations.jsonl       # One explicit published provider/model/run
│   ├── run_manifest.json           # Published run identity and provenance
│   ├── evaluation_output.csv       # Scored case-level results
│   ├── flagged_cases.jsonl         # WARN/FAIL subset for review
│   ├── summary.md                  # Human-readable run summary
│   └── cache/
│       └── raw_generations_cache.jsonl   # Reusable raw-generation cache/history store
├── docs/
│   ├── architecture.md             # System overview
│   ├── artifacts_guide.md          # File-by-file artifact guide
│   ├── REVIEWER_WORKFLOW.md        # Artifact trust map and review order
│   ├── reviewer_package.md         # Derived reviewer package boundaries and usage
│   ├── reviewer_guide.md           # Fast reviewer walkthrough
│   ├── results_interpretation.md   # Benchmark interpretation guidance
│   ├── safety_case.md              # Safety framing and hazards
│   ├── failure_modes.md            # Failure taxonomy and known limitations
│   ├── notable_failures.md         # Representative cases
│   ├── CODEX_RUNBOOK.md            # Repo-local Codex workflow
│   └── maintenance_boundaries.md   # Eval-sensitive change policy
├── AGENTS.md                       # Codex operating constraints
├── Makefile                        # Local verification helpers
├── requirements.txt
└── README.md

Evaluation Pipeline

dataset/clinical_questions.csv
-> src/prompt_templates.py
-> src/generate_answers.py
-> results/raw_generations.jsonl + results/run_manifest.json
-> src/run_evaluation.py + src/metrics.py
-> results/evaluation_output.csv + results/flagged_cases.jsonl
-> src/summarize_results.py
-> results/summary.md

Core Outputs

The main review artifacts are:

  • results/raw_generations.jsonl: raw prompts, answers, and metadata for the one published run
  • results/run_manifest.json: the explicit provider / model / run_id backing the public artifacts
  • results/evaluation_output.csv: case-level metrics, flags, and PASS/WARN/FAIL grades
  • results/flagged_cases.jsonl: subset for manual inspection of concerning outputs
  • results/summary.md: compact benchmark report with rates, means, and worst cases
  • results/cache/raw_generations_cache.jsonl: reusable raw-generation cache/history store that is not itself the public benchmark set

The reviewer package is a generated convenience view, not a canonical benchmark artifact. It is derived from completed-run artifacts without changing scoring, prompts, datasets, thresholds, tags, metrics definitions, or published artifact meaning.

To generate it locally:

make reviewer-package

Equivalent direct command:

python src/build_reviewer_report.py --results-dir results

Then open reviewer_packages/<provider>_<model_id>_<run_id>/reviewer_report.html in a browser. The package is ignored by git and also includes reviewer_summary.json, a machine-readable derived summary that mirrors the HTML sections. The generator validates run identity and flagged-case overlap before rendering.

Running The Project

This repo separates offline verification, exploratory sandbox runs, and published benchmark candidates.

Offline verification

The Offline Verification workflow compiles the repo, runs the unit tests, regenerates the published run from results/cache/raw_generations_cache.jsonl, and checks that the public artifacts reproduce exactly.

Quick local verification

For a fast local health check before reviewing deeper:

python -m unittest discover -s tests -v
python -m py_compile src/*.py tests/*.py

The same checks are available through:

make verify

Working with Codex

Repo-local Codex guidance lives in AGENTS.md, with the operational runbook in docs/CODEX_RUNBOOK.md.

Future Codex sessions should classify changes as docs-only maintenance, derived tooling, sandbox support, benchmark revision, or result refresh before editing. Benchmark-defining files and checked-in results/ artifacts should only change when that is the explicit task.

Sandbox runs

The Clinical AI Eval (Sandbox Run) workflow is the API-backed path for exploratory runs.

Use it for:

  • partial-dataset smoke tests
  • prompt iteration checks
  • provider comparisons
  • mock-provider validation runs

Sandbox runs write to sandbox_results/ inside the workflow and upload artifacts for review. They do not overwrite results/.

Published benchmark candidate

The Clinical AI Eval (Published Benchmark Candidate) workflow is the guarded path for generating a full-dataset benchmark candidate for manual review.

It:

  • forces the full dataset
  • rejects the mock provider
  • runs compile + unit-test checks first
  • generates a live run, then rebuilds the candidate artifact set from cache
  • verifies exact reproducibility before uploading the candidate artifacts

Published candidates are uploaded for manual review rather than pushed directly back to the repo.

Expected workflow inputs:

Input Example Description
model gpt-4o Model used for generation
prompt_version v1 Prompt label tracked in artifacts
run_id 20260330_candidate Explicit benchmark-candidate run identifier

Local script entry points

If a reviewer wants to inspect the mechanics, the main scripts are:

  • src/generate_answers.py
  • src/run_evaluation.py
  • src/summarize_results.py
  • src/build_reviewer_report.py

src/generate_answers.py supports --run-kind sandbox, --run-kind candidate, and --run-kind published. Use sandbox for exploratory or partial runs, candidate for full-dataset review artifacts, and published only for the checked-in canonical benchmark set and offline reproducibility.

Supported generation providers for src/generate_answers.py:

  • openai using OPENAI_API_KEY
  • anthropic using ANTHROPIC_API_KEY
  • gemini using GEMINI_API_KEY or GOOGLE_API_KEY
  • mock for deterministic pipeline validation without API access

The multi-provider support above exists at the generation-script layer. The checked-in GitHub Actions workflows currently wire openai for API-backed runs and mock for pipeline validation; using anthropic or gemini in CI would require extending workflow secrets and inputs.

Documentation Guide

  • docs/architecture.md: architecture, modules, and data flow
  • docs/artifacts_guide.md: what each results artifact contains and how to read it
  • docs/REVIEWER_WORKFLOW.md: step-by-step artifact review order and source-of-truth guidance
  • docs/reviewer_package.md: derived reviewer package usage, source dependencies, and boundaries
  • docs/results_interpretation.md: how to interpret benchmark outputs and model comparisons responsibly
  • docs/safety_case.md: safety framing, hazards, and mitigations
  • docs/failure_modes.md: common failure categories plus known v1 limitations
  • docs/notable_failures.md: representative flagged cases
  • docs/reviewer_guide.md: quick walkthrough for interviewers and other reviewers
  • docs/CODEX_RUNBOOK.md: repo-local operating workflow for future Codex sessions
  • docs/maintenance_boundaries.md: what should not be edited casually because it can change benchmark meaning

Eval-Sensitive Areas

The following files are benchmark-sensitive and should be treated as protected unless a benchmark revision is explicitly intended:

  • dataset/clinical_questions.csv
  • src/prompt_templates.py
  • src/metrics.py
  • src/run_evaluation.py
  • src/generate_answers.py
  • results/run_manifest.json
  • results/summary.md
  • results/evaluation_output.csv
  • results/flagged_cases.jsonl
  • results/raw_generations.jsonl

See docs/maintenance_boundaries.md for the maintenance policy used in this repo.

Known Scope Boundaries

  • The dataset is intentionally small and reviewable.
  • Safety flags are heuristic and incomplete.
  • Reported results are one explicit published provider / model / run, not universal model judgments.
  • Human clinical review is outside the automated pipeline.
  • Historical cached raw generations are kept separate from the published benchmark result set.

Governance Signals

It demonstrates:

  • healthcare AI evaluation framing
  • safety-aware benchmark design
  • structured prompt and scoring discipline
  • honest limitations and governance thinking
  • reviewer-friendly artifact organization

Disclaimer

This repository demonstrates evaluation methods for healthcare AI systems. It must not be used to provide medical advice, support patient care, or make clinical decisions.

About

A lightweight evaluation framework that simulates how a healthcare company might risk-test an LLM before deploying it into clinical decision-support workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors