Millwright

Adaptive tool selection for AI agents. Millwright learns from feedback to recommend better tools over time, combining semantic search with historical fitness scores.

Based on the design described in Millwright: Smarter Tool Selection with Adaptive Toolsheds.

How it works

An agent calls suggest_tools(query) and gets back a ranked list of candidate tools. After using a tool, the agent calls review_tools(session, reviews) with a fitness rating (perfect, related, unrelated, or broken). Over time, this feedback loop improves which tools get surfaced for similar queries.

The pipeline:

Decompose the query into atomic subqueries
Embed subqueries and tool descriptions with all-MiniLM-L6-v2
Semantic rank — cosine similarity between subquery and tool embeddings (max across subqueries)
Historical rank — look up similar past queries in the review index, weight by fitness and similarity
Fuse — interleave both rankings with holdout guarantees (at least N slots from each signal), then rerank by combined score
Explore — epsilon-greedy random injection (10%) to discover underused tools
Return top-k candidates plus a __none__ sentinel

Feedback flows back through an append-only review log. Periodically, K-means compaction clusters the log into a compact review index for fast lookup.

Project structure

millwright/              # Core library
  toolshed.py            # Main orchestrator — suggest_tools() and review_tools()
  ranking.py             # Semantic ranking, historical ranking, holdout fusion
  compaction.py          # K-means clustering of review log into review index
  config.py              # All hyperparameters in one dataclass
  models.py              # Data structures (ToolDefinition, ReviewEntry, etc.)
  embedder.py            # SentenceTransformer wrapper with caching
  decomposer.py          # Query decomposition (mock + Claude API)
  storage.py             # JSONL review log + JSON review index

benchmark/               # Evaluation framework
  run_benchmark.py       # Entry point — learning curve, sweeps, baselines
  simulation.py          # Multi-round simulation with simulated agent feedback
  baselines.py           # Random and TF-IDF baseline rankers
  tools.py               # 200 synthetic tools across 12 domains
  queries.py             # 120 benchmark queries in 3 difficulty tiers
  metrics.py             # MRR, Precision@k, Hit@k
  report.py              # D3.js HTML report generation

Getting started

With Nix (recommended)

nix develop          # Sets up Python 3.12 venv + installs deps

Without Nix

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Run the benchmark

# Full benchmark (100 rounds, all sweeps, baselines, holdout, multi-turn)
python -m benchmark.run_benchmark --rounds 100 --seeds 3 --holdout 0.2 --continuation-prob 0.3

# Quick check (10 rounds, no sweeps or baselines)
python -m benchmark.run_benchmark --rounds 10 --no-sweep --no-fitness-sweep --no-baselines --no-compaction-sweep

# Multi-seed averaging with noisy feedback (correlated noise model)
python -m benchmark.run_benchmark --rounds 100 --seeds 5 --noise 0.1 --noise-model correlated

# Export structured results for external analysis
python -m benchmark.run_benchmark --results-json results.json

# Load narrative descriptions from a JSON file
python -m benchmark.run_benchmark --descriptions descriptions.json

# Custom output path
python -m benchmark.run_benchmark -o my_report.html

The benchmark produces a benchmark_report.html with interactive D3 charts. Without --descriptions, the report renders charts and tables only (no interpretive text). Pass a JSON file mapping section keys to prose to populate narrative blocks.

CLI reference

Flag	Default	Description
`--rounds N`	100	Learning curve rounds
`--seeds N`	1	Random seeds for averaging (≥3 enables CIs and significance tests)
`--seed N`	42	Base random seed
`--holdout F`	0.0	Fraction of queries held out for test-only evaluation (stratified by tier)
`--continuation-prob F`	0.0	Probability of multi-turn continuation when P@1 misses
`--noise F`	0.0	Feedback noise probability (0.0–1.0)
`--noise-model`	uniform	`uniform` or `correlated` (category confusion pairs)
`--sweep-rounds N`	10	Rounds per configuration in sweeps
`--no-sweep`		Skip slot holdout sweep
`--no-fitness-sweep`		Skip fitness multiplier sweep
`--no-baselines`		Skip baseline comparison (random, TF-IDF, semantic)
`--no-compaction-sweep`		Skip compaction frequency sweep
`--results-json PATH`		Save structured results to JSON
`--descriptions PATH`		Load narrative description blocks from JSON
`-o PATH`	benchmark_report.html	Output HTML report path

Working with results

--results-json writes a JSON file with all benchmark data for external analysis:

python -m benchmark.run_benchmark --rounds 50 --seeds 3 --holdout 0.2 --results-json results.json

The JSON structure:

{
  "simulation": {
    "adaptive": [                    # per-round metrics
      {
        "round": 1,
        "overall": {"mrr": ..., "p@1": ..., "p@3": ..., "p@5": ..., "hit@5": ...},
        "tier_1": {...}, "tier_2": {...}, "tier_3": {...},
        "overall_std": {...},        # stddev across seeds (when seeds > 1)
        "overall_ci": {"mrr": [lo, hi], ...},  # bootstrap 95% CI (when seeds >= 3)
        "train_overall": {...},      # training set only (when --holdout > 0)
        "test_overall": {...},       # holdout set only
        "test_tier_1": {...}, ...
      },
      ...
    ],
    "baseline": [...],               # same shape, semantic-only (no feedback)
    "n_seeds": 3,
    "feedback_noise": 0.0,
    "significance": {                # when seeds >= 3
      "wilcoxon": {"mrr": {"statistic": ..., "p_value": ...}, ...},
      "adaptive_final_ci": {"mrr": [lo, hi], ...}
    }
  },
  "baselines": [                     # when baselines enabled
    {"label": "Random",   "metrics": {"overall": {...}, "tier_1": {...}, ...}},
    {"label": "TF-IDF",   "metrics": {...}},
    {"label": "Semantic", "metrics": {...}}
  ],
  "slot_sweep": [                    # when sweep enabled
    {"label": "S5/H0", "min_semantic_slots": 5, "min_historical_slots": 0,
     "rounds": [...]},
    ...
  ],
  "fitness_sweep": [                 # when fitness sweep enabled
    {"label": "Mild", "preset": {"perfect": 1.2, ...}, "rounds": [...]},
    ...
  ],
  "compaction_sweep": [              # when compaction sweep enabled
    {"label": "Every round", "compact_every": 1, "rounds": [...]},
    ...
  ],
  "elapsed": 18405.3
}

Example: extract the learning curve into a CSV for plotting elsewhere:

import json, csv

with open("results.json") as f:
    data = json.load(f)

with open("learning_curve.csv", "w", newline="") as f:
    w = csv.writer(f)
    w.writerow(["round", "mrr", "p@1", "hit@5", "t3_mrr"])
    for r in data["simulation"]["adaptive"]:
        w.writerow([r["round"], r["overall"]["mrr"], r["overall"]["p@1"],
                    r["overall"]["hit@5"], r["tier_3"]["mrr"]])

Report descriptions

The --descriptions flag accepts a JSON file mapping section keys to HTML strings. Any key not present renders as empty (charts and tables still appear). Section keys:

intro, methodology, learning_curves, mrr_caption, p1_caption, hit_caption, milestones, improvement, improvement_tier3, improvement_takeaway, slot_sweep, slot_sweep_interpretation, slot_sweep_mrr_caption, slot_sweep_p1_caption, fitness_sweep, fitness_sweep_interpretation, fitness_sweep_mrr_caption, fitness_sweep_p1_caption, holdout_eval, baselines, multi_turn, compaction_sweep, compaction_sweep_interpretation

Example:

{
  "intro": "Millwright is an adaptive tool selection system for AI agents.",
  "improvement_takeaway": "<strong>Key finding:</strong> Tier 3 gains are largest."
}

Benchmark results

200 tools across 12 domains, 120 queries in 3 tiers:

Tier	Count	Description	Example
1 — Direct	45	Query closely matches tool description	"read a file"
2 — Indirect	40	Rephrased/colloquial query	"check what's inside this document"
3 — Ambiguous	35	Vague, multiple correct tools	"get the data"

Results from a 100-round, 3-seed run with 20% holdout and multi-turn testing:

Baselines (single-pass, no feedback):

Method	MRR	P@1	Hit@5	T3 MRR
Random	0.018	0.006	0.050	0.023
TF-IDF	0.617	0.550	0.708	0.549
Semantic-only	0.796	0.700	0.928	0.743
Adaptive (R100)	0.879	0.825	0.956	0.890

Learning curve (adaptive, 3-seed average):

Round	MRR	P@1	Hit@5	T3 MRR
1	0.797	0.700	0.933	0.743
5	0.863	0.811	0.942	0.845
10	0.863	0.806	0.950	0.820
50	0.870	0.817	0.950	0.871
100	0.879	0.825	0.956	0.890

Sweep highlights:

Slot holdout: S4/H1 (mostly semantic + 1 historical slot) best at MRR=0.936
Fitness multipliers: "Mild" and "Wide" tied at MRR=0.908; "Flat" (no learning signal) drops to 0.660
Compaction frequency: every 1–2 rounds optimal; every 10+ rounds degrades to no-learning baseline

Train/test holdout (20% held out, no feedback): Train MRR reaches 0.943 while test MRR settles at 0.626 — a large gap. This is expected: the toolshed is designed to memorize which tools work for queries the agent actually asks. The improvement is concentrated on seen queries, which is the intended use case. Test set performance stays well above Random (0.018) and TF-IDF (0.617), showing some generalization through embedding similarity in the review index.

Open benchmark_report.html for full interactive charts including holdout evaluation, baseline comparison, multi-turn results, and all sweep visualizations.

Algorithm deep dives

Where to look if you want to understand or modify specific parts:

Topic	File	What to read
Core loop	`millwright/toolshed.py`	`suggest_tools()` and `review_tools()` are the two API entry points
Ranking signals	`millwright/ranking.py`	`semantic_rank()`, `historical_rank()`, and `fuse_rankings()`
Fusion strategy	`millwright/ranking.py:93`	Holdout + interleave + rerank — the key design choice
Review compaction	`millwright/compaction.py`	K-means clustering with centroid-weighted fitness aggregation
Hyperparameters	`millwright/config.py`	Every tunable knob in one place
NONE sentinel	`millwright/toolshed.py:117`	Implicit negative feedback when agent rejects all suggestions
Feedback simulation	`benchmark/simulation.py:57`	How the benchmark generates perfect/related/unrelated ratings
Query difficulty tiers	`benchmark/queries.py`	The 120 test queries with ground truth

Hacking on things

Tune hyperparameters

Everything is in millwright/config.py. The most impactful knobs:

min_semantic_slots / min_historical_slots — how many top-k slots are guaranteed from each ranking signal. The slot sweep in the benchmark explores this space.
fitness_perfect / fitness_related / fitness_unrelated / fitness_broken — how aggressively feedback shifts future rankings. The fitness sweep explores this.
historical_similarity_threshold (default 0.3) — minimum cosine similarity to consider a historical review relevant. Lower = more recall, more noise.
epsilon (default 0.1) — probability of replacing the last suggestion with a random unexplored tool.

Change the fusion strategy

The fusion logic lives in ranking.py:fuse_rankings(). The current approach is holdout + interleave + score-based rerank. If you want to try weighted sums, reciprocal rank fusion, or something else, this is the function to replace.

Add tools or queries

Tools: benchmark/tools.py — add entries to the tool list with a name, description, and category.
Queries: benchmark/queries.py — add BenchmarkQuery(query, expected_tools, category, tier) entries.

Swap the embedding model

Change embedding_model in config.py. The Embedder class in embedder.py wraps sentence-transformers, so any model it supports will work. Update embedding_dim to match.

Use real query decomposition

The benchmark uses MockDecomposer (splits on conjunctions). For production use, ClaudeDecomposer in decomposer.py calls the Claude API to semantically decompose compound queries. Pass it to the Toolshed constructor instead.

Add new ranking signals

The fusion function takes two score dicts (semantic + historical). To add a third signal (e.g., recency, popularity, cost), produce a dict[str, float] and modify fuse_rankings() to accept and interleave it.

Benchmark limitations

Addressed

The following issues from earlier versions have been fixed:

Train/test split — --holdout 0.2 splits queries stratified by tier. Training queries get feedback; holdout queries are evaluate-only. Test metrics measure generalization.
Multiple baselines — Random, TF-IDF, and semantic-only baselines run alongside the adaptive system.
Statistical rigor — Bootstrap CIs (10k resamples) when --seeds >= 3. Paired Wilcoxon signed-rank test for adaptive vs. baseline significance.
Multi-turn sessions — --continuation-prob 0.3 exercises continue_session() when P@1 misses. Tracks multi-turn hit rate and rounds needed.
Compaction timing — Compaction frequency sweep tests every 1/2/5/10/20 rounds.
Correlated noise — --noise-model correlated defines confusion pairs (file↔system, http↔cloud, database↔transform, crypto↔auth, messaging↔monitoring) with 3× degradation probability, modeling systematic agent errors rather than uniform noise.
Report narrative — Interpretive text is no longer hardcoded. The report renders data only unless --descriptions provides a JSON file with narrative blocks.

Still open

Synthetic tools with clean categories — The 200 tools have crisp, non-overlapping categories. Real tool catalogs have fuzzy boundaries, overlapping functionality, and misleading descriptions.
MockDecomposer sidesteps a core feature — The benchmark uses naive conjunction splitting, not the Claude-based decomposer. Per-subquery storage is mostly 1:1 with mock decomposition, so its value is untested.
No distribution shift or catalog changes — Tools and queries are fixed for the entire run. No testing of stale reviews for removed tools, cold-start for newly added tools, or query distribution drift.
Metric interpretation is use-case dependent — If agents scan all 5 suggestions, Hit@5 matters most. If they only use #1, P@1 is all that matters. The report presents all metrics without asserting which matters most.

Fidelity notes

See NEXT_STEPS.md for a list of spec-alignment fixes applied against the original blog post. The core algorithm — decompose, embed, dual-rank, holdout-fuse, feedback loop, K-means compaction — matches the post. Minor gaps: the "create custom tool" option and shadow testing during compaction are not implemented.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmark		benchmark
millwright		millwright
.gitignore		.gitignore
NEXT_STEPS.md		NEXT_STEPS.md
README.md		README.md
benchmark_report.html		benchmark_report.html
benchmark_results.json		benchmark_results.json
flake.lock		flake.lock
flake.nix		flake.nix
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Millwright

How it works

Project structure

Getting started

With Nix (recommended)

Without Nix

Run the benchmark

CLI reference

Working with results

Report descriptions

Benchmark results

Algorithm deep dives

Hacking on things

Tune hyperparameters

Change the fusion strategy

Add tools or queries

Swap the embedding model

Use real query decomposition

Add new ranking signals

Benchmark limitations

Addressed

Still open

Fidelity notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Millwright

How it works

Project structure

Getting started

With Nix (recommended)

Without Nix

Run the benchmark

CLI reference

Working with results

Report descriptions

Benchmark results

Algorithm deep dives

Hacking on things

Tune hyperparameters

Change the fusion strategy

Add tools or queries

Swap the embedding model

Use real query decomposition

Add new ranking signals

Benchmark limitations

Addressed

Still open

Fidelity notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages