TEVO

So what: TEVO is a typed evolutionary search loop for language-model architectures with a practical handoff into upstream-style train.py research loops. It discovers motifs under cheap proxy budgets, TrainRecipe carries the renderer-safe subset of those motifs, and downstream autoresearch-style train.py acts as the judge of whether they survive a real benchmark.

Current proof-of-concept: a projected TEVO-discovered frontier sibling improved upstream CUDA autoresearch from val_bpb = 1.117953 to 1.113392 on the pinned upstream commit c12eef778edafc89cd7ce036a7f500ddb5397a65, while also reducing peak VRAM.

What This Repo Is

Architectures are defined in typed YAML (Pydantic). The loop mutates and crosses over these specs, trains each candidate for a short budget, scores it on multiple objectives, and keeps a Pareto frontier. Children inherit weights from parents and crossover merges checkpoints when possible.

The core split is:

TEVO searches broadly for promising motifs under cheap proxy budgets.
TrainRecipe packages the shared, renderer-safe part of a winning candidate.
autoresearch validates whether that motif still helps inside a real downstream train.py.

The point is not just to search for good small models. The point is to discover motifs cheaply, transfer them cleanly, and validate them honestly downstream.

Current Proof

The repo now supports this search -> transfer -> validation loop end to end:

TEVO discovers a candidate under a short-budget benchmark.
The candidate is exported into a backend-neutral TrainRecipe.
Oversized recipes are projected into the upstream CUDA autoresearch size envelope while preserving the motif.
The projected recipe is rendered into upstream-style train.py.
The resulting train.py is benchmarked on the real 5-minute CUDA autoresearch loop.

The first clean proof of that loop is now in place:

Probe	Result
Stock upstream CUDA `autoresearch` baseline	`val_bpb = 1.117953`
TEVO motif-transfer probe	`val_bpb = 1.113392`
Delta	`-0.004561` better
Peak VRAM	`43.96 GiB -> 35.89 GiB`

That is the current shareable story: TEVO found a motif, the bridge transferred it honestly, and upstream autoresearch improved.

This is a single projected probe on a pinned upstream commit, so treat it as proof that motif transfer works, not as a broad leaderboard claim yet.

The small proof bundle behind that claim lives in artifacts/motif_transfer_proof.

Start Here

docs/motif_transfer_demo.md for the clearest explanation of the current TEVO -> autoresearch result
artifacts/motif_transfer_proof for the compact artifact bundle behind the current proof
docs/autoresearch_at_home_handoff.md for the one-candidate contribution lane into autoresearch@home
campaigns/campaign-001-d20-nanochat for the first optional collaborative pilot
docs/train_recipe_bridge.md for the bridge design and compatibility rules
docs/cuda_transfer_demo.md for the CUDA workflow on Modal
docs/configuration_guide.md for search configs, objectives, and selection policies

`autoresearch@home` Contribution Lane

So what: TEVO now has a clean way to contribute one chosen candidate into autoresearch@home without turning TEVO itself into a swarm runner.

TEVO stays the search engine.
TrainRecipe stays the shared bridge artifact.
autoresearch@home becomes the collaborative train.py experiment lane.

The practical workflow is:

run TEVO locally
choose one bridge-compatible candidate
export an autoresearch@home handoff workspace
run that candidate in an autoresearch@home fork or branch
publish the result, insight, and next hypothesis through their coordinator

evo-loop autoresearch-at-home-handoff \
  --frontier runs/<run>/frontier.json \
  --candidate-id <candidate_id> \
  --run-root runs/at_home_handoff

That command writes a staged repo/ workspace with the TEVO-rendered train.py already installed. See docs/autoresearch_at_home_handoff.md for the full flow.

Why This Exists

This repo is a sandbox for answering: under a fixed training recipe and constraints, what architectural motifs survive?

explore quality / speed / memory trade-offs without hand-authoring dozens of variants
keep the search inspectable: every run emits a frontier, lineage, and checkpoints
use small or short-budget runs as a fast feedback loop before spending larger GPU budgets downstream
separate cheap motif discovery from honest downstream validation instead of treating them as the same benchmark

Installation

Prerequisites: Python 3.11+, a compatible accelerator (CUDA / MPS / CPU), and uv.

git clone https://github.com/strangeloopcanon/tevo.git
cd tevo
make setup
source .venv/bin/activate

Optional: export HF_TOKEN="..." for Hugging Face dataset access, export TOKENIZERS_PARALLELISM=false to suppress tokenizer fork warnings.

Quick Start

Run a smoke evolution loop (CPU, ~2 minutes):

export TOKENIZERS_PARALLELISM=false
RUN="runs/live_smoke_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RUN"

python scripts/run_live.py configs/live_smoke.yaml \
  --device cpu --generations 3 --steps 40 --eval-batches 2 --seed 0 \
  --out "$RUN/frontier.json" \
  --lineage-out "$RUN/frontier_lineage.json" \
  --state-out "$RUN/frontier.state.json" \
  --checkpoint-dir "$RUN/checkpoints" \
  --prune-checkpoints-to-frontier \
  2>&1 | tee "$RUN/live.log"

python scripts/report_motifs.py "$RUN/frontier.json" \
  --lineage "$RUN/frontier_lineage.json" --top 10

Then export a winner and rerun it:

CANDIDATE_ID="<paste_id_from_report>"
python scripts/export_seed.py "$RUN/frontier.json" \
  --id "$CANDIDATE_ID" --out-config configs/seed_winner.yaml
python scripts/run_live.py configs/seed_winner.yaml \
  --device cpu --generations 1 --steps 40 --eval-batches 2 --seed 0 \
  --out runs/replay/frontier.json

See docs/configuration_guide.md for longer sweep commands (MPS, CUDA, Modal).

Optional Campaign Layer

So what: TEVO can now support a small GitHub-native collaboration loop without changing the normal solo workflow. In v1, collaboration is artifact-based, not distributed training: contributors run normal TEVO sweeps, then submit small bundles derived from the run artifacts the repo already emits.

The first sendable collaborator packet lives at campaigns/campaign-001-d20-nanochat/COLLABORATOR_BRIEF.md.
Runs happen locally or on Modal, not in Git branches. GitHub is for claiming lanes and submitting compact bundles by PR.
Use the campaign issue as the source of truth for lane ownership. If Discussions are enabled, use them for questions or recruiting, not for claims.
campaigns/.../manifest.yaml fixes the comparison context: base config fingerprint, metric, budget, and lane list.
python scripts/campaign_submit.py ... packages an existing run into a tracked bundle under artifacts/campaigns/....
If you are packaging an older run without frontier.manifest.json, pass --config explicitly instead of guessing provenance.
python scripts/campaign_aggregate.py ... merges comparable bundles, and python scripts/campaign_shortlist.py ... picks the next bridgeable candidates for downstream validation.
If you ignore campaigns, nothing about the existing evo-loop or scripts/run_live.py flow changes.

What You Get From A Run

Each run writes artifacts under runs/<run_id>/:

File	Purpose
`frontier.json`	Non-dominated candidates (spec + metrics + id)
`frontier_lineage.json`	Genealogy graph (parents, mutations, crossover reports)
`frontier.manifest.json`	Run recipe + environment snapshot
`checkpoints/`	Model checkpoints (often pruned to frontier-only)
`live.log`	Full run log

System Architecture

flowchart LR
    subgraph evo [TEVO Search]
        SEED[Typed Seed Config] --> MUT[Mutate/Crossover]
        MUT --> TRAIN[Train Proxy Budget]
        TRAIN --> EVAL[Evaluate Metrics]
        EVAL --> SELECT[Pareto Frontier]
        SELECT --> |Next Gen| MUT
    end
    SELECT --> EXPORT[Export Winner]

    subgraph bridge [Transfer Bridge]
        EXPORT --> RECIPE[TrainRecipe]
        RECIPE --> PROJECT[Project Motif Into Safe Scaffold]
        PROJECT --> PATCH[Patch Upstream train.py]
    end

    subgraph judge [Downstream Validation]
        PATCH --> BENCH[autoresearch Benchmark]
        BENCH --> RESULT[val_bpb / VRAM Result]
    end

Component	What it does
DSL (`dsl.py`)	Typed architecture genome (blocks, layers, mixers, training hparams)
Orchestrator (`orchestrator.py`)	Population management, selection (`map_elites`, `lexicase`, etc.)
Mutations (`mutations.py`)	Genetic operators that modify specs (grow, shrink, toggle)
Crossover (`crossover.py`)	Splice two architectures + merge checkpoints
Evaluation (`evaluation.py`)	Runged filtering: static analysis, short training, full training
TrainRecipe Bridge (`train_recipe.py`)	Export renderer-safe motif knobs and patch downstream `train.py` targets
Transfer Workflows (`cuda_transfer.py`, `mlx_transfer.py`)	Stage honest baseline-vs-seeded downstream validation runs

Project Structure

configs/                          YAML experiment configurations
docs/                             Deep-dive documentation
scripts/
    run_live.py                   Main entry point for evolution
    export_seed.py                Export a winner as a new seed
    report_motifs.py              Motif analysis
    run_benchmark.py              NanoGPT-style benchmark
    archive_run.py                Archive runs and reclaim disk
src/transformer_evolution_llm/
    dsl.py                        Architecture specification (Pydantic)
    models.py                     PyTorch block implementations
    orchestrator.py               Evolution engine
    trainer.py                    Training loop with weight inheritance
    mutations.py                  Genetic operators
    crossover.py                  Crossover + checkpoint merge
tests/                            Pytest suite

Scope

Bounded search space: evolution can only produce what the DSL + mutation set can express. To explore a new primitive, add it to the codebase first.
Short-budget proxies: metrics come from short surrogate training. The frontier moves when you increase data, steps, or change objectives.
Not a leaderboard: the NanoGPT-style benchmark is for repeatable within-repo comparisons.

Full Documentation Index

Topic	Document
Configuration, objectives, selection profiles	docs/configuration_guide.md
Architecture comparison (GPT-2 vs nanochat vs TEVO), glossary	docs/architecture_comparison.md
Current TEVO -> autoresearch motif-transfer proof	docs/motif_transfer_demo.md
Example frontier survivors	docs/example_frontiers.md
Findings & operational lessons	docs/evolution_takeaways.md
Optimizer & method discovery	docs/optimizer_method_discovery.md
Run history & evolution log	docs/run_history.md
Running on Modal GPUs	docs/modal_run.md
TrainRecipe bridge to autoresearch / MLX (including CUDA-safe motif projection)	docs/train_recipe_bridge.md
First public CUDA transfer workflow	docs/cuda_transfer_demo.md
First public MLX transfer workflow	docs/mlx_transfer_demo.md
GPU scale-up plan (350M-1B)	docs/gpu_run_plan.md
NanoGPT-style benchmark contract	docs/nanogpt_benchmark.md
Scaling policies	docs/scale_policy.md
nanochat alignment notes	docs/nanochat_alignment.md
CLI reference & workflows	docs/cli_reference.md
Troubleshooting	docs/troubleshooting.md

Roadmap

350M-1B sanity runs on GPU -- validate that motifs discovered at small scale transfer. See docs/gpu_run_plan.md.
Speedrun-style eval -- scale the packed-token cache and calibrate targets for useful dynamic range. See docs/nanogpt_benchmark.md.
Multi-GPU evolution with ZeRO-1 or FSDP for larger populations.

Contributing

We welcome contributions! See AGENTS.md for operational guidelines.

make setup       # Create venv + install deps
make check       # Format + lint + types + security
make test        # Run test suite

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
artifacts		artifacts
campaigns/campaign-001-d20-nanochat		campaigns/campaign-001-d20-nanochat
configs		configs
docs		docs
images		images
scripts		scripts
src/transformer_evolution_llm		src/transformer_evolution_llm
tests		tests
.agents.yml		.agents.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.secrets.baseline		.secrets.baseline
Makefile		Makefile
README.md		README.md
RUNS_HISTORY.md		RUNS_HISTORY.md
STATUS.md		STATUS.md
pyproject.toml		pyproject.toml
run_nanochat_fineweb_modal_matrix.sh		run_nanochat_fineweb_modal_matrix.sh
run_phi_creative_curriculum.sh		run_phi_creative_curriculum.sh
run_phi_creative_super_recur.sh		run_phi_creative_super_recur.sh
run_phi_promotion_mps.sh		run_phi_promotion_mps.sh
run_phi_seed36_long_mps.sh		run_phi_seed36_long_mps.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEVO

What This Repo Is

Current Proof

Start Here

`autoresearch@home` Contribution Lane

Why This Exists

Installation

Quick Start

Optional Campaign Layer

What You Get From A Run

Scope

Roadmap

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TEVO

What This Repo Is

Current Proof

Start Here

autoresearch@home Contribution Lane

Why This Exists

Installation

Quick Start

Optional Campaign Layer

What You Get From A Run

Scope

Roadmap

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`autoresearch@home` Contribution Lane

Packages