An evaluation-first AI engineering case study for evolving a retry/backoff policy with OpenEvolve, local LLMs, strict QA-style gates, and holdout validation.
Positioning: Building evaluation-first AI systems for QA automation, reliability testing, and complex workflow validation.
Core thesis: The evaluator is the product. LLMs can propose code changes, but evaluator design, safety gates, holdout validation, and QA discipline determine whether the generated behavior is actually useful.
This repository is not just a retry-policy toy example. It is a small, reproducible experiment in AI-assisted software behavior improvement:
- define a narrow mutation surface
- generate candidate policies with LLMs
- evaluate behavior deterministically
- reject unsafe or overfitted candidates
- compare train vs holdout performance
- document model failure modes and evaluator weaknesses
The system under test is a single function, choose_action(), that decides
how a client should react to a transient or permanent failure when calling an
unstable HTTP-style endpoint. Possible decisions are retry,
switch_endpoint, open_circuit, or fail.
OpenEvolve mutates only the code inside an EVOLVE-BLOCK in
initial_program.py. Everything else — the contract,
helper code, scenarios, and evaluator — is stable and is treated as the
test harness against which candidates are judged.
Retry/backoff is a deceptively small problem with big production consequences:
- flaky upstream APIs and gateways
- transient overload, rate limits, and slow responses
- non-idempotent operations where retrying is unsafe
- circuit-breaker style fail-fast behavior
- failover between endpoints
It is also a good proxy task for AI-assisted reliability work: the search space is narrow enough for small local models, but the evaluator can encode real QA-style invariants (safety, latency, success rate, generalization).
- a deterministic evaluator with train/holdout splits
- a narrow mutation surface (one function inside one block)
- hard safety gates (e.g. dangerous non-idempotent retries are heavily penalized)
- train vs holdout comparison to detect reward hacking and overfitting
- model failure-mode catalog observed across local models
- a structured acceptance checklist for candidate review
- a report template for repeatable post-run analysis
initial_program.py -> choose_action() inside an EVOLVE-BLOCK
evaluator.py -> loads program dynamically, runs scenarios, returns metrics
scenarios.py -> deterministic train (seed=42) / holdout (seed=314) sets
config.yaml -> OpenEvolve run configuration (local-friendly defaults)
tests/ -> pytest contract checks
evaluator.py exposes evaluate(program_path) -> Dict[str, float] for
OpenEvolve and a CLI (--program, --dataset, --json). On any exception
the evaluator returns combined_score = -1e9 and runs_successfully = 0.0
— it never crashes silently.
.
├── README.md # This file
├── AGENTS.md # Project guidelines for coding agents
├── EXPERIMENTS.md # Experiment journal
├── LESSONS_LEARNED.md # Distilled insights
├── Taskfile.yml # Reproducible task runner entry points
├── config.yaml # OpenEvolve config (local LLMs)
├── evaluator.py # Deterministic evaluator + CLI
├── initial_program.py # choose_action() with EVOLVE-BLOCK
├── scenarios.py # Train/holdout scenario generators
├── requirements.txt
├── docs/
│ ├── POSITIONING.md # How this repo maps to QA / reliability work
│ ├── MODEL_FAILURE_MODES.md # Observed LLM failure modes
│ ├── ACCEPTANCE_CRITERIA.md # Hard / behavioral / holdout gates
│ ├── ARTICLE_SERIES_PLAN.md # Outline of upcoming write-ups
│ ├── REPORT_TEMPLATE.md # Per-run report template
│ ├── REPO_GUIDE.md # Where to look for what
│ ├── DOD.md # Definition of Done
│ ├── prompts/review_and_harden.md # Review/hardening prompt
│ └── reports/
│ ├── FINAL_REPORT.md # Polished case-study summary
│ ├── EXPERIMENT_LOG.md # Chronological observations
│ └── HOLDOUT_ANALYSIS.md # Train vs holdout deep dive
├── experiments/
│ ├── README.md
│ └── rounds/ # Saved OpenEvolve artifacts (best / checkpoints / logs)
│ ├── round_1_cloud_openevolve_output/
│ ├── round_2_local_openevolve_output/
│ ├── round_3_local_openevolve_output/
│ ├── round_4_local_openevolve_output/
│ ├── round_5_local_openevolve_output/
│ └── round_7_local_openevolve_output/
├── tests/ # pytest contract suite
└── scripts/ # Helper scripts (report generation, post-edit hooks)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtOr via Task:
task installtask testDirect invocation:
pytest -qtask eval:train
task eval:holdout
# both at once
task evaltask evolve
# or with custom iteration count
ITERATIONS=20 task evolve
# or with debug logging
task evolve:debugOpenEvolve must be installed (it is listed in requirements.txt). The run
configuration is in config.yaml. Local model endpoints
(Ollama, vLLM, LM Studio, OptiLLM) are supported via OpenAI-compatible APIs.
The Taskfile supports overriding the evaluated program via the CANDIDATE
variable. The default is initial_program.py:
# default: evaluate baseline
task eval:holdout
# evaluate an evolved candidate (path is illustrative)
CANDIDATE=experiments/rounds/round_1_cloud_openevolve_output/best/best_program.py task eval:holdout
# evaluate a checkpoint
CANDIDATE=experiments/rounds/round_2_local_openevolve_output/checkpoints/checkpoint_5/best_program.py task eval:trainSave train + holdout JSON for a candidate in one shot:
CANDIDATE=path/to/candidate.py task report
# writes reports/train.json and reports/holdout.jsonTrain (seed=42):
{
"avg_latency_ms": 1320.0,
"avg_retry_count": 2.03125,
"combined_score": -8.678125,
"dangerous_non_idempotent_retries": 0.0,
"good_endpoint_switches": 0.0,
"good_fail_fast_decisions": 0.28125,
"runs_successfully": 1.0,
"success_rate": 0.40625,
"useless_retries": 0.6875
}Holdout (seed=314):
{
"avg_latency_ms": 2651.25,
"avg_retry_count": 1.8125,
"combined_score": -29.69375,
"dangerous_non_idempotent_retries": 0.0,
"good_endpoint_switches": 0.0,
"good_fail_fast_decisions": 0.3125,
"runs_successfully": 1.0,
"success_rate": 0.5625,
"useless_retries": 0.4375
}The train set is used for evolution / search. The holdout set is used only for generalization checks and is not visible to the search loop.
Multiple OpenEvolve rounds were performed against both hosted and local models. See EXPERIMENTS.md for the full journal and docs/reports/FINAL_REPORT.md for the polished write-up.
A representative finding: a qwen2.5-coder:1.5b candidate improved train
combined_score from -8.6781 to +1.1094 while reducing
success_rate from 0.4062 to 0.3438. On holdout, candidates from the
same family showed regressions (e.g. success_rate dropping from 0.5625
to 0.375). This is the canonical example of why scalar score must not be
the sole acceptance signal.
- valid LLM-generated mutations are possible, including from small local models
- train score can be misleading; latency-driven score gains can mask reliability regressions
- holdout validation is mandatory, not optional
- local models exhibit distinct failure modes (no-diff, prose, contract drift, invalid keys)
- small local models can still be useful when the mutation surface is narrow
- scalar combined score is necessary but not sufficient for acceptance
- the evaluator is the main engineering artifact, not the policy itself
- the weakest link is evaluator/simulator realism
A candidate must pass all of the following before being considered accepted. Full checklist in ACCEPTANCE_CRITERIA.md.
- program imports successfully
choose_actionsignature preserved- return shape preserved (
{"action": str, "wait_ms": int}) runs_successfully == 1.0dangerous_non_idempotent_retries == 0.0- no undefined variables, no external state, no network calls
- holdout
success_ratenot materially worse than baseline - diff reviewed by a human
Documented in docs/MODEL_FAILURE_MODES.md. Highlights:
- contract drift (changed signature)
- invalid payload keys (
wait_timeinstead ofwait_ms) - undefined variables (
last_request_latency,error_code) - prose instead of code
- no-op mutations
- excessive rewrite outside the intended block
- reward hacking (improving latency by failing fast)
- repeated invalid candidates from very small models
- not a production retry library
- not a benchmark proving one model is best
- not a claim that LLMs can autonomously ship reliability logic
- not a replacement for domain-specific design review
- not a complete distributed-systems simulator
- the synthetic scenario set is small and stylized
- the evaluator does not model partial failures, jitter, or queue effects
- there is no adversarial scenario generator yet
- combined score weights are hand-tuned and reward-hackable
- holdout uses a single seed (
314); deeper holdouts are future work
- richer scenario generation (queueing effects, partial failures, jitter)
- better endpoint-switch / failover modeling
- multi-seed and adversarial holdouts
- mutation-testing-style stress scenarios
- automated per-run report generation
- model comparison matrix across local + hosted backends
- extension to fintech protocol / FIX message lifecycle validation
- extension to document-processing QA using mdify-style pipelines
- LLM output regression testing harness
- EXPERIMENTS.md — experiment journal
- Experiment artifacts — curated index of committed OpenEvolve run outputs.
- ACCEPTANCE_CRITERIA.md — hard, behavioral, and holdout gates
- LESSONS_LEARNED.md — distilled lessons
- docs/POSITIONING.md — how this maps to QA / reliability work
- docs/MODEL_FAILURE_MODES.md — observed LLM failure modes
- docs/REPORT_TEMPLATE.md — per-run report template
- docs/REPO_GUIDE.md — navigation guide
- docs/reports/FINAL_REPORT.md — polished case-study summary
- docs/reports/EXPERIMENT_LOG.md — chronological log
- docs/reports/HOLDOUT_ANALYSIS.md — holdout deep dive
A planned set of public write-ups based on this repository is outlined in docs/ARTICLE_SERIES_PLAN.md. Topics include evaluation-first AI engineering, train-vs-holdout traps, and applying these patterns to fintech protocol testing and AI-assisted QA tooling.
This project is licensed under the MIT License. See LICENSE.