Skip to content

lizhongxuan/self_evolving

Repository files navigation

🧬 Self-Evolving Framework

A general-purpose evaluation-driven search & improvement closed-loop framework
Implement 5 adapter interfaces. The framework handles the rest.

Quick StartPrompt OptimizationArchitectureComponentsTesting中文文档


🎯 What is this?

A framework that orchestrates the complete generate → evaluate → select → remember → regenerate evolution loop. It handles parallel scheduling, secure sandboxing, lineage tracking, experience retrieval, and search strategies out of the box. You bring the domain logic through 5 minimal adapter interfaces.

External Input → TaskAdapter → Population Init → Search & Select Parents
                                                        ↓
                  Experience Deposit ← Prune ← Update ← Fitness Scoring
                        ↑                                     ↑
                  RAG Retrieval → Mutation → PreCheck → Sandbox Eval → Parse Results

🚀 Use Cases

Scenario Candidate Variant Sandbox Fitness
Automated Program Repair Code patches Run tests Pass rate
Configuration Optimization Parameter combos Run benchmarks Performance metrics
🔥 Prompt Optimization Prompt variants Call LLM Output quality
Model Fine-tuning Hyperparams / data mix Run training Validation metrics

⚡ Quick Start

git clone <repo-url> && cd self_evolving
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Run all tests (704 cases + 63 property-based tests)
pytest tests/ -q

# Watch the evolution process in action
pytest tests/test_e2e.py::TestNumberOptimisationVerbose -v -s

Evolution Demo Output

======================================================================
  Number Optimization Evolution Demo
  Objective: f(x) = -(x-5)² + 25    Optimum: x=5.0, f(5)=25.0
======================================================================

  Round   Pop    Evals   Best f(x)
  ─────────────────────────────────
   1       1      1      f=15.11
   3       1      1      f=23.43
   6       1      1      f=24.99
  10       1      1      f=25.00  ✓

  Best x: 4.9839  →  f(x) = 24.9997

Implement 5 Adapters & Run

from self_evolving.adapters import *
from self_evolving.orchestrator.registry import InterfaceRegistry
from self_evolving.orchestrator.evolution_loop import EvolutionLoop

class MyTaskAdapter(TaskAdapter):
    def parse(self, external_input: dict) -> TaskSpec: ...

class MyCandidateApplier(CandidateApplier):
    def apply(self, task, base_workspace, candidate) -> str: ...

class MyResultParser(ResultParser):
    def parse(self, task, run_artifacts) -> EvalResult: ...

class MyFitnessFunction(FitnessFunction):
    def score(self, task, result, candidate) -> Fitness: ...

class MyFitnessComparator(FitnessComparator):
    def better(self, a: Fitness, b: Fitness) -> bool: ...

# Register, configure, and run
registry = InterfaceRegistry()
registry.register_task_adapter(MyTaskAdapter())
# ... register all 5 adapters

loop = EvolutionLoop(registry=registry, config=config, ...)
result = loop.run({"repo": "my-project", "issue": "bug-123"})

🔥 Enterprise Prompt Optimization

The flagship application built on top of the framework. A production-grade prompt optimization platform that integrates real LLM APIs (GLM / OpenAI) and uses evolutionary algorithms to automatically discover optimal prompts.

Core Capabilities

┌─────────────────────────────────────────────────────────────────┐
│                     OptimizationEngine                          │
│                                                                 │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌──────────┐ │
│  │ LLM Client │  │  Eval      │  │  Mutation   │  │Overfitting│ │
│  │ GLM/OpenAI │  │  Pipeline  │  │  Strategies │  │ Detection │ │
│  │ Rate Limit │  │ NLP+Judge  │  │ Meta+Struct │  │ Separation│ │
│  │ Token Budg │  │ Tiered     │  │ Annealing   │  │ Gaming    │ │
│  └────────────┘  └────────────┘  └────────────┘  └──────────┘ │
│                                                                 │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌──────────┐ │
│  │Persistence │  │  A/B Test  │  │  Progress   │  │  Eval    │ │
│  │ Checkpoint │  │ Traffic    │  │  Monitor    │  │  Cache   │ │
│  │ Recovery   │  │ Routing    │  │ Callbacks   │  │ TTL+Cap  │ │
│  └────────────┘  └────────────┘  └────────────┘  └──────────┘ │
└─────────────────────────────────────────────────────────────────┘

30-Second Quickstart

pip install zhipuai
python examples/run_with_glm.py   # Edit line 18 with your API key

Full Example: Optimizing Prompts for an AIOps Platform

from prompt_optimization.config import *
from prompt_optimization.engine import OptimizationEngine
from prompt_optimization.evaluation.dataset import EvaluationDataset
from prompt_optimization.evaluation.models import DatasetItem
from prompt_optimization.llm.client import LLMClient, LLMResponse

# 1️⃣  Connect to GLM API
from zhipuai import ZhipuAI
_zhipu = ZhipuAI(api_key="your-api-key")

def glm_request(model, prompt, temperature, max_tokens):
    resp = _zhipu.chat.completions.create(
        model="glm-5",
        messages=[{"role": "user", "content": prompt}],
        temperature=max(0.01, temperature),
        max_tokens=max_tokens,
    )
    return LLMResponse(
        text=resp.choices[0].message.content,
        prompt_tokens=resp.usage.prompt_tokens,
        completion_tokens=resp.usage.completion_tokens,
        model="glm-5",
    )

cfg = LLMClientConfig(backend="glm", api_key="your-key", max_retries=3)
client = LLMClient(config=cfg, request_fn=glm_request)

# 2️⃣  Prepare evaluation dataset
dataset = EvaluationDataset(items=[
    DatasetItem(
        input_text="Build an alert management module with multi-level alerts",
        reference_output="Alert page with severity filters, aggregation view...",
    ),
    DatasetItem(
        input_text="Create a server resource monitoring page with real-time CPU/memory",
        reference_output="Monitoring page with metric cards, time-series charts...",
    ),
    # ... more evaluation pairs
], split_ratios=(0.6, 0.2, 0.2), seed=42)

# 3️⃣  Configure the optimization engine
config = PromptOptimizationConfig(
    llm_configs={"glm": cfg},
    mutation_model="glm",
    judge_model="glm",
    evaluation_weights={
        "accuracy": 1.5,                  # Feature completeness
        "fluency": 0.8,                   # Code readability
        "terminology_consistency": 1.2,    # Domain terminology
        "format_compliance": 1.0,          # Output format
        "safety": 0.5,                     # Safety
    },
    max_iterations=30,
    population_size=5,
    token_budget=2_000_000,
    annealing=AnnealingConfig(
        initial_temp=1.0, final_temp=0.1, schedule="cosine"
    ),
)

# 4️⃣  Launch optimization
engine = OptimizationEngine(config=config, llm_clients={"glm": client})
result = engine.run(
    seed_prompt_text="You are a senior full-stack engineer. Generate frontend code.",
    dataset=dataset,
)

print(f"Best prompt:  {result['best_prompt']}")
print(f"Best score:   {result['best_fitness']:.4f}")
print(f"Iterations:   {result['iterations']}")
print(f"Token usage:  {result['token_usage']['total_tokens']}")

Evolution Progress

[Engine] 🌱 Seed score: 0.4870
[Evolution] ── Iteration 1 ── pop=1
  [LLM] Calling GLM-5 for mutation...
  [Pipeline] Evaluating 7 items → overall=0.5234 ✓
[Evolution] ── Iteration 2 ── pop=2
  [LLM] Meta-prompting mutation...
  [Pipeline] Evaluating → overall=0.6891 ✓ New best!
  ...
[Evolution] ── Iteration 30 ── best=0.9234

✅ Optimization complete!
  Best prompt: You are a senior frontend architect with 10+ years...
  Score lift:  0.4870 → 0.9234 (+89.6%)

Standalone Component Usage

Every component works independently — no need to run the full evolution loop:

# NLP Scoring (pure Python, no external deps for BLEU/ROUGE-L)
from prompt_optimization.evaluation.nlp_scorer import NLPScorer
scorer = NLPScorer()
result = scorer.score("The weather is nice", "The weather is nice")
# → BLEU-1=1.0, ROUGE-L=1.0

# LLM-as-Judge (5-dimension scoring)
from prompt_optimization.evaluation.llm_judge import LLMJudge
judge = LLMJudge(llm_client=client, model="glm-5")
scores = judge.judge(generated, reference, task_description)
# → {accuracy: 0.9, fluency: 0.85, terminology_consistency: 0.78, ...}

# Overfitting Detection
from prompt_optimization.overfitting.detector import OverfittingDetector
detector = OverfittingDetector(config=overfitting_config)
detector.check_dataset_separation(dev_score=0.9, val_score=0.7)  # → True
detector.detect_gaming("prompt with ground_truth leaked")         # → True

# A/B Testing with Statistical Significance
from prompt_optimization.online.ab_test import ABTestManager
ab = ABTestManager()
exp_id = ab.create_experiment("old_prompt", "new_prompt", traffic_ratio=0.3)
# ... collect metrics ...
result = ab.evaluate_experiment(exp_id)
# → lift=+15.2%, p=0.003, recommendation="promote"

# Structured Prompt Serialization
from prompt_optimization.prompt.structured import StructuredPrompt, PromptExample
sp = StructuredPrompt(
    role="Expert translator",
    task_description="Translate English to Chinese",
    examples=[PromptExample(input_text="Hello", output_text="你好")],
    cot_instruction="First understand the meaning, then choose the right expression",
)
json_str = sp.to_json()                        # Serialize
restored = StructuredPrompt.from_json(json_str) # Deserialize (round-trip safe)
print(sp.render_formatted())                    # Human-readable output

🏗 Architecture

self_evolving/                    ← Core evolution framework
├── models/                       ← Immutable data models (dataclasses)
├── adapters/                     ← 5 domain adapter interface ABCs
├── agentic/                      ← Agent engine (Critique/Mutation/RAG/SelfCheck)
├── memory/                       ← Evolution memory (lineage DAG + vector experience store)
├── search/                       ← Search control (GA/MCTS/population/pruning)
├── sandbox/                      ← Secure sandbox (container isolation/cache/retry)
└── orchestrator/                 ← Closed-loop orchestrator (evolution loop/parallel/overfitting)

prompt_optimization/              ← Enterprise prompt optimization (flagship app)
├── llm/                          ← LLM clients (GLM/OpenAI, rate limiting, token budget)
├── evaluation/                   ← Eval system (NLP metrics, LLM Judge, pipeline, tiered, cache)
├── mutation/                     ← Mutation strategies (meta-prompting, structured, annealing)
├── overfitting/                  ← Overfitting protection (separation, penalty, gaming detection)
├── persistence/                  ← Persistence (VectorMemory disk, state serialization, checkpoints)
├── online/                       ← Online closed-loop (A/B testing, traffic routing, significance)
├── monitoring/                   ← Real-time monitoring (iteration reports, callbacks, dual-channel)
├── prompt/                       ← Structured prompts (serialization, rendering)
├── adapters.py                   ← Framework adapter implementations
├── config.py                     ← Configuration dataclasses
└── engine.py                     ← OptimizationEngine top-level orchestrator

🔌 Pluggable Components

Component Built-in Implementations Replaceable
SearchAlgo GeneticAlgorithm, MCTSSearch
Pruner FitnessPruner, AgePruner, DuplicatePruner
FitnessAggregator WeightedSum, Lexicographic, Pareto
Critique LLMCritique (with rule fallback)
Mutation LLMMutation, MetaPrompter, StructuredMutation
SelfCheck DefaultSelfCheck (syntax + format)
VectorBackend InMemoryVectorBackend ✅ (Faiss, Milvus)
LLMClient GLM, OpenAI (unified interface)
EvalPipeline NLPScorer + LLMJudge + weighted aggregation

🧠 Key Design Decisions

  • Vectorized Fitness — Native multi-objective optimization with built-in Pareto front sorting
  • Memory as First-Class Citizen — Failure/success experiences auto-deposited to vector store, RAG-assisted mutation
  • Full Lineage Traceability — Every candidate's evolution path is traceable; pruned candidates are marked, not deleted
  • Secure by Default — Sandbox defaults to network isolation, read-only filesystem, seccomp, rootless
  • Built-in Overfitting Governance — Dataset separation detection + length penalty + gaming pattern detection + score spike detection
  • Triple Cost Control — Token budget hard cap + tiered evaluation + evaluation cache
  • Resumable — State serialization + automatic checkpoints + integrity verification

🧪 Testing

# All tests
pytest tests/ -q

# Prompt optimization module only (33 property tests + unit + integration + E2E)
pytest tests/prompt_optimization/ -v

# Property-based tests only (63 correctness properties)
pytest tests/ -v -k "property or Property"

# Evolution demo
pytest tests/test_e2e.py::TestNumberOptimisationVerbose -v -s

Test Matrix

Layer Count Tool Coverage
Unit Tests 600+ pytest Component behavior, edge cases, error paths
Property Tests 63 Hypothesis 33 prompt optimization + 30 framework properties
Integration Tests 20+ pytest + mock Eval pipeline, LLM call chain, tiered evaluation
End-to-End Tests 10+ pytest Full evolution loop, checkpoint recovery, A/B testing

📄 License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages