A general-purpose evaluation-driven search & improvement closed-loop framework
Implement 5 adapter interfaces. The framework handles the rest.
Quick Start • Prompt Optimization • Architecture • Components • Testing • 中文文档
A framework that orchestrates the complete generate → evaluate → select → remember → regenerate evolution loop. It handles parallel scheduling, secure sandboxing, lineage tracking, experience retrieval, and search strategies out of the box. You bring the domain logic through 5 minimal adapter interfaces.
External Input → TaskAdapter → Population Init → Search & Select Parents
↓
Experience Deposit ← Prune ← Update ← Fitness Scoring
↑ ↑
RAG Retrieval → Mutation → PreCheck → Sandbox Eval → Parse Results
| Scenario | Candidate Variant | Sandbox | Fitness |
|---|---|---|---|
| Automated Program Repair | Code patches | Run tests | Pass rate |
| Configuration Optimization | Parameter combos | Run benchmarks | Performance metrics |
| 🔥 Prompt Optimization | Prompt variants | Call LLM | Output quality |
| Model Fine-tuning | Hyperparams / data mix | Run training | Validation metrics |
git clone <repo-url> && cd self_evolving
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Run all tests (704 cases + 63 property-based tests)
pytest tests/ -q
# Watch the evolution process in action
pytest tests/test_e2e.py::TestNumberOptimisationVerbose -v -s======================================================================
Number Optimization Evolution Demo
Objective: f(x) = -(x-5)² + 25 Optimum: x=5.0, f(5)=25.0
======================================================================
Round Pop Evals Best f(x)
─────────────────────────────────
1 1 1 f=15.11
3 1 1 f=23.43
6 1 1 f=24.99
10 1 1 f=25.00 ✓
Best x: 4.9839 → f(x) = 24.9997
from self_evolving.adapters import *
from self_evolving.orchestrator.registry import InterfaceRegistry
from self_evolving.orchestrator.evolution_loop import EvolutionLoop
class MyTaskAdapter(TaskAdapter):
def parse(self, external_input: dict) -> TaskSpec: ...
class MyCandidateApplier(CandidateApplier):
def apply(self, task, base_workspace, candidate) -> str: ...
class MyResultParser(ResultParser):
def parse(self, task, run_artifacts) -> EvalResult: ...
class MyFitnessFunction(FitnessFunction):
def score(self, task, result, candidate) -> Fitness: ...
class MyFitnessComparator(FitnessComparator):
def better(self, a: Fitness, b: Fitness) -> bool: ...
# Register, configure, and run
registry = InterfaceRegistry()
registry.register_task_adapter(MyTaskAdapter())
# ... register all 5 adapters
loop = EvolutionLoop(registry=registry, config=config, ...)
result = loop.run({"repo": "my-project", "issue": "bug-123"})The flagship application built on top of the framework. A production-grade prompt optimization platform that integrates real LLM APIs (GLM / OpenAI) and uses evolutionary algorithms to automatically discover optimal prompts.
┌─────────────────────────────────────────────────────────────────┐
│ OptimizationEngine │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐ │
│ │ LLM Client │ │ Eval │ │ Mutation │ │Overfitting│ │
│ │ GLM/OpenAI │ │ Pipeline │ │ Strategies │ │ Detection │ │
│ │ Rate Limit │ │ NLP+Judge │ │ Meta+Struct │ │ Separation│ │
│ │ Token Budg │ │ Tiered │ │ Annealing │ │ Gaming │ │
│ └────────────┘ └────────────┘ └────────────┘ └──────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐ │
│ │Persistence │ │ A/B Test │ │ Progress │ │ Eval │ │
│ │ Checkpoint │ │ Traffic │ │ Monitor │ │ Cache │ │
│ │ Recovery │ │ Routing │ │ Callbacks │ │ TTL+Cap │ │
│ └────────────┘ └────────────┘ └────────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
pip install zhipuai
python examples/run_with_glm.py # Edit line 18 with your API keyfrom prompt_optimization.config import *
from prompt_optimization.engine import OptimizationEngine
from prompt_optimization.evaluation.dataset import EvaluationDataset
from prompt_optimization.evaluation.models import DatasetItem
from prompt_optimization.llm.client import LLMClient, LLMResponse
# 1️⃣ Connect to GLM API
from zhipuai import ZhipuAI
_zhipu = ZhipuAI(api_key="your-api-key")
def glm_request(model, prompt, temperature, max_tokens):
resp = _zhipu.chat.completions.create(
model="glm-5",
messages=[{"role": "user", "content": prompt}],
temperature=max(0.01, temperature),
max_tokens=max_tokens,
)
return LLMResponse(
text=resp.choices[0].message.content,
prompt_tokens=resp.usage.prompt_tokens,
completion_tokens=resp.usage.completion_tokens,
model="glm-5",
)
cfg = LLMClientConfig(backend="glm", api_key="your-key", max_retries=3)
client = LLMClient(config=cfg, request_fn=glm_request)
# 2️⃣ Prepare evaluation dataset
dataset = EvaluationDataset(items=[
DatasetItem(
input_text="Build an alert management module with multi-level alerts",
reference_output="Alert page with severity filters, aggregation view...",
),
DatasetItem(
input_text="Create a server resource monitoring page with real-time CPU/memory",
reference_output="Monitoring page with metric cards, time-series charts...",
),
# ... more evaluation pairs
], split_ratios=(0.6, 0.2, 0.2), seed=42)
# 3️⃣ Configure the optimization engine
config = PromptOptimizationConfig(
llm_configs={"glm": cfg},
mutation_model="glm",
judge_model="glm",
evaluation_weights={
"accuracy": 1.5, # Feature completeness
"fluency": 0.8, # Code readability
"terminology_consistency": 1.2, # Domain terminology
"format_compliance": 1.0, # Output format
"safety": 0.5, # Safety
},
max_iterations=30,
population_size=5,
token_budget=2_000_000,
annealing=AnnealingConfig(
initial_temp=1.0, final_temp=0.1, schedule="cosine"
),
)
# 4️⃣ Launch optimization
engine = OptimizationEngine(config=config, llm_clients={"glm": client})
result = engine.run(
seed_prompt_text="You are a senior full-stack engineer. Generate frontend code.",
dataset=dataset,
)
print(f"Best prompt: {result['best_prompt']}")
print(f"Best score: {result['best_fitness']:.4f}")
print(f"Iterations: {result['iterations']}")
print(f"Token usage: {result['token_usage']['total_tokens']}")[Engine] 🌱 Seed score: 0.4870
[Evolution] ── Iteration 1 ── pop=1
[LLM] Calling GLM-5 for mutation...
[Pipeline] Evaluating 7 items → overall=0.5234 ✓
[Evolution] ── Iteration 2 ── pop=2
[LLM] Meta-prompting mutation...
[Pipeline] Evaluating → overall=0.6891 ✓ New best!
...
[Evolution] ── Iteration 30 ── best=0.9234
✅ Optimization complete!
Best prompt: You are a senior frontend architect with 10+ years...
Score lift: 0.4870 → 0.9234 (+89.6%)
Every component works independently — no need to run the full evolution loop:
# NLP Scoring (pure Python, no external deps for BLEU/ROUGE-L)
from prompt_optimization.evaluation.nlp_scorer import NLPScorer
scorer = NLPScorer()
result = scorer.score("The weather is nice", "The weather is nice")
# → BLEU-1=1.0, ROUGE-L=1.0
# LLM-as-Judge (5-dimension scoring)
from prompt_optimization.evaluation.llm_judge import LLMJudge
judge = LLMJudge(llm_client=client, model="glm-5")
scores = judge.judge(generated, reference, task_description)
# → {accuracy: 0.9, fluency: 0.85, terminology_consistency: 0.78, ...}
# Overfitting Detection
from prompt_optimization.overfitting.detector import OverfittingDetector
detector = OverfittingDetector(config=overfitting_config)
detector.check_dataset_separation(dev_score=0.9, val_score=0.7) # → True
detector.detect_gaming("prompt with ground_truth leaked") # → True
# A/B Testing with Statistical Significance
from prompt_optimization.online.ab_test import ABTestManager
ab = ABTestManager()
exp_id = ab.create_experiment("old_prompt", "new_prompt", traffic_ratio=0.3)
# ... collect metrics ...
result = ab.evaluate_experiment(exp_id)
# → lift=+15.2%, p=0.003, recommendation="promote"
# Structured Prompt Serialization
from prompt_optimization.prompt.structured import StructuredPrompt, PromptExample
sp = StructuredPrompt(
role="Expert translator",
task_description="Translate English to Chinese",
examples=[PromptExample(input_text="Hello", output_text="你好")],
cot_instruction="First understand the meaning, then choose the right expression",
)
json_str = sp.to_json() # Serialize
restored = StructuredPrompt.from_json(json_str) # Deserialize (round-trip safe)
print(sp.render_formatted()) # Human-readable outputself_evolving/ ← Core evolution framework
├── models/ ← Immutable data models (dataclasses)
├── adapters/ ← 5 domain adapter interface ABCs
├── agentic/ ← Agent engine (Critique/Mutation/RAG/SelfCheck)
├── memory/ ← Evolution memory (lineage DAG + vector experience store)
├── search/ ← Search control (GA/MCTS/population/pruning)
├── sandbox/ ← Secure sandbox (container isolation/cache/retry)
└── orchestrator/ ← Closed-loop orchestrator (evolution loop/parallel/overfitting)
prompt_optimization/ ← Enterprise prompt optimization (flagship app)
├── llm/ ← LLM clients (GLM/OpenAI, rate limiting, token budget)
├── evaluation/ ← Eval system (NLP metrics, LLM Judge, pipeline, tiered, cache)
├── mutation/ ← Mutation strategies (meta-prompting, structured, annealing)
├── overfitting/ ← Overfitting protection (separation, penalty, gaming detection)
├── persistence/ ← Persistence (VectorMemory disk, state serialization, checkpoints)
├── online/ ← Online closed-loop (A/B testing, traffic routing, significance)
├── monitoring/ ← Real-time monitoring (iteration reports, callbacks, dual-channel)
├── prompt/ ← Structured prompts (serialization, rendering)
├── adapters.py ← Framework adapter implementations
├── config.py ← Configuration dataclasses
└── engine.py ← OptimizationEngine top-level orchestrator
| Component | Built-in Implementations | Replaceable |
|---|---|---|
| SearchAlgo | GeneticAlgorithm, MCTSSearch | ✅ |
| Pruner | FitnessPruner, AgePruner, DuplicatePruner | ✅ |
| FitnessAggregator | WeightedSum, Lexicographic, Pareto | ✅ |
| Critique | LLMCritique (with rule fallback) | ✅ |
| Mutation | LLMMutation, MetaPrompter, StructuredMutation | ✅ |
| SelfCheck | DefaultSelfCheck (syntax + format) | ✅ |
| VectorBackend | InMemoryVectorBackend | ✅ (Faiss, Milvus) |
| LLMClient | GLM, OpenAI (unified interface) | ✅ |
| EvalPipeline | NLPScorer + LLMJudge + weighted aggregation | ✅ |
- Vectorized Fitness — Native multi-objective optimization with built-in Pareto front sorting
- Memory as First-Class Citizen — Failure/success experiences auto-deposited to vector store, RAG-assisted mutation
- Full Lineage Traceability — Every candidate's evolution path is traceable; pruned candidates are marked, not deleted
- Secure by Default — Sandbox defaults to network isolation, read-only filesystem, seccomp, rootless
- Built-in Overfitting Governance — Dataset separation detection + length penalty + gaming pattern detection + score spike detection
- Triple Cost Control — Token budget hard cap + tiered evaluation + evaluation cache
- Resumable — State serialization + automatic checkpoints + integrity verification
# All tests
pytest tests/ -q
# Prompt optimization module only (33 property tests + unit + integration + E2E)
pytest tests/prompt_optimization/ -v
# Property-based tests only (63 correctness properties)
pytest tests/ -v -k "property or Property"
# Evolution demo
pytest tests/test_e2e.py::TestNumberOptimisationVerbose -v -s| Layer | Count | Tool | Coverage |
|---|---|---|---|
| Unit Tests | 600+ | pytest | Component behavior, edge cases, error paths |
| Property Tests | 63 | Hypothesis | 33 prompt optimization + 30 framework properties |
| Integration Tests | 20+ | pytest + mock | Eval pipeline, LLM call chain, tiered evaluation |
| End-to-End Tests | 10+ | pytest | Full evolution loop, checkpoint recovery, A/B testing |
MIT