awesome-autoresearch

A curated awesome list of public autoresearch use cases across industries.

This README is the homepage aggregate of the current category files, so the latest accepted entries are visible here without drilling into subpages.

The repository distinguishes between:

primary categories for stronger case evidence such as repos, project pages, and concrete write-ups
secondary overlap categories for cross-cutting patterns that reuse the same evidence from another angle
Related Practices / Discussions for credible public practice signals — especially X threads, Reddit discussions, and interviews — that describe real autoresearch usage even when no strong standalone case page exists yet.

Why this list

Most discussions about autoresearch are still scattered, vague, or overly tool-centric. This list is designed to answer two practical questions quickly:

Where has autoresearch already been used in real workflows?
Which patterns can transfer across industries?

This is not a comprehensive database. It is a high-signal, fast-scanning field guide.

Inclusion criteria

An entry should meet all of the following:

The source is public and citable.
The example is directly related to autoresearch, not just a generic research or monitoring agent.
The source explicitly mentions autoresearch, cites Karpathy's autoresearch, or clearly shows a modify → verify → keep/discard → repeat loop.
The summary explains the scenario, method, and value in one sentence.

We do not include:

Generic research agents, monitoring agents, or multi-agent systems with no explicit autoresearch loop.
Pure theory or opinion without a concrete practice.
Generic AI commentary with no autoresearch workflow.
Long write-ups inside the list itself.
Sources that are private, inaccessible, or too vague to classify.

Current coverage

Some entries intentionally appear in more than one overlap category when the same project is both a domain case and a reusable workflow pattern.

Browse by category

Full list

Scientific Research

Source file: categories/scientific-research.md

AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
Sibyl Research System - Scientific research: builds a fully autonomous AI scientist on Claude Code with inner research-iteration loops and outer self-evolution across projects.
autoresearch-rl - RL research: applies the autoresearch pattern to RL post-training by iterating on one training config, running fixed-time experiments, and keeping only eval improvements.
autoresearch-robotics - Robotics research: adapts Karpathy-style autoresearch to MuJoCo and Gymnasium robotics tasks by editing one training file, evaluating fixed-budget runs, and using simulator renderings plus vision feedback to keep only better policies.
Tinker-Explorer - Evidence-retrieval research: adapts the autoresearch pattern to GRPO document exploration, comparing reward designs and keeping only retrieval policies that answer multihop questions more accurately under a token budget.
Autoresearch on an old research idea - Multimodal retrieval research: applies Claude Code autoresearch to an old eCLIP idea, running 42 fixed-budget experiments with commit/revert decisions and cutting mean rank from 344.68 to 157.43.
autoresearch-at-home - Distributed ML research: coordinates a SETI@home-style swarm of agents that claim experiments, share full train.py results through Ensue, and collectively drive down val_bpb across different GPUs.
autoresearch-paper-benchmark - Graph ML research: runs paper-driven campaigns on a fixed Peptides-func benchmark by editing train.py, logging 300-second experiments, and testing only the best validation-AP model at campaign end.
autoresearch-cifar10 - Vision research: applies autoresearch to CIFAR-10 ResNet training on a 3090, iterating under fixed time budgets and keeping changes that lift accuracy beyond a 91.89% baseline.
AutoResearch-GenPose - Vision research: adapts autoresearch to CIFAR-10 UNet denoising by editing one training file, running fixed 5-minute experiments, and keeping only val_psnr improvements.
MLP-AutoResearch - MNIST training research: ports Karpathy's single-file loop to an MLP classifier, fixing 20-epoch runs and greedy keep/revert decisions that raised handwritten-digit accuracy from 0.9809 to 0.9836.
autoresearch-medimage - Medical imaging research: adapts Karpathy's prepare.py + train.py + results.tsv loop to 2D imaging tasks, using short-budget candidate discovery and staged follow-up validation to surface stronger ChestXray14 models.
autocircuit - Analog circuit optimization: adapts Karpathy's autoresearch to a SKY130 two-stage op-amp, editing optimize.py, running ngspice, and keeping only parameter changes that expand the GBW-versus-power Pareto front under phase-margin constraints.
fe-autoresearch - Tabular ML research: applies the autoresearch loop to LightGBM feature engineering on the UCI Bank Marketing dataset by editing one engineer_features() target, training against fixed AUC metrics, and keeping only improvements.
Paper Lantern improves Autoresearch - ML research augmentation: connects a 2M-paper MCP server to autoresearch, letting the agent cite 100 papers across 100 experiments and reach a 3.2% lower 2-hour validation loss than the same run without paper access.
Subtractive Search in a Mature Tabular Pipeline - Tabular ML research: applies Karpathy's autoresearch to a churn-prediction XGBoost pipeline, running 116 autonomous experiments and lifting subsample AUC from 0.902892 to 0.916721 largely by removing noisy target-encoded features.
autoresearch-connect4 - Game AI research: adapts Karpathy's three-file autoresearch loop to Connect Four by editing train.py, training 5-minute self-play runs, and keeping only changes that improve weighted win rate against fixed opponents.
autoresearch-tabular - Tabular ML research: adapts Karpathy's three-file loop to the Adult Income benchmark by editing only train.py, running fixed 2-minute experiments, and keeping only val_auc improvements.
ocr-autoresearch - OCR research: adapts autoresearch to ICDAR2015 scene-text recognition by editing one train.py, running fixed 5-minute CRNN+CTC experiments, and keeping only lower validation character error rates.
Tennis XGBoost Autoresearch - Sports analytics research: applies a Karpathy-style keep/revert loop to a 245K-match tennis XGBoost pipeline, then hardens the evaluator after the agent learned to game mutable ROC-AUC scoring.
Bio-Autoresearch - Drug discovery research: applies a Karpathy-style autoresearch loop to rare-disease drug repurposing on PrimeKG, running 15 GPU experiments with keep/revert decisions and lifting held-out per-disease AUPRC from 0.284 to 0.761.
autoresearch-quantum - Quantum research: runs incumbent/challenger autoresearch ratchets for encoded magic-state experiments, screens candidates on cheap noisy simulations, and promotes only justified challengers to expensive backends while logging transferable lessons.
kaggle-autoresearch - Tabular ML research: adapts Karpathy-style autoresearch to Kaggle competitions such as Titanic, House Prices, and Store Sales by iterating on feature and model code, logging approved baselines, and accepting only cross-validation improvements over fixed thresholds.
MiniMax M2.7: Early Echoes of Self-Evolution - AI-lab research: describes an internal research agent that automates 30%-50% of RL workflow and a 100+ round keep/revert scaffold-optimization loop that improved internal evaluation scores by 30%.
autoresearch-macro - Macroeconomic forecasting research: runs LLM-guided outer-loop search over Chronos-2 covariates, transforms, and fine-tuning settings, keeping only validation-era forecast improvements across pseudo-real-time Norway, Canada, and Sweden benchmarks.
autoresearch-dqn - RL algorithm research: applies the autoresearch loop to a CartPole training script, logging 39 iterations that replaced an unstable DQN baseline with a REINFORCE agent that reaches reward 500 in about 5 seconds instead of about 3 minutes.
AutoMedal - Kaggle competition research: adapts Karpathy's keep/revert loop into strategist, researcher, and experimenter phases, journaling 24 tabular-competition experiments and keeping only lower val_loss changes on a fixed leaderboard-oriented harness.
autoresearch-qwen - Document VQA research: adapts Karpathy's keep/discard loop to Qwen3-VL on the official DocVQA benchmark by fixing evaluate.py, limiting edits to train.py, and accepting only higher full-validation ANLS scores.

Software / Systems Optimization

Source file: categories/software-systems-optimization.md

karpathy/autoresearch - ML training optimization: the original autoresearch loop where an agent edits a GPT training script, runs fixed-time experiments, and keeps only improvements in validation bits-per-byte.
AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
autoresearch-webgpu - Browser ML optimization: ports Karpathy's autoresearch into the browser so agents can generate training code, run GPU-backed experiments, and feed losses back into the next iteration.
autoresearch-local-llm - Local ML optimization: replaces Claude Code with a local Qwen model to run the standard autoresearch keep/revert loop on a shared single GPU.
Shopify Liquid performance work via autoresearch - Software optimization: Tobi Lütke applied an autoresearch loop to Shopify's Liquid template engine, producing 93 automated commits that improved parse+render performance by 53% with 61% fewer allocations.
Autoresearch for SAT Solvers - SAT solver optimization: runs parallel MaxSAT experiments, updates reusable solver code plus expert memory, and improves public benchmark configurations against 2024 competition baselines.
autoresearch — Heuristic CP Edition - Heuristic solver optimization: adapts autoresearch to C++ competitive-programming solvers by editing only solver.cpp, scoring fixed benchmark instances, and keeping only lower average solution costs.
Autoresearch for game development - HTML5 Game Development: Runs agents to build better games based on player feedback and usage metrics, benchmarks using game ELO in 1/1 matchups.
SiliconSwarm@Ensue - Apple Silicon inference optimization: uses a multi-agent autoresearch loop to test ANE graph changes across chips and reports up to 6.31× lower median DistilBERT latency than CoreML.
Flash-MoE - Apple Silicon inference optimization: uses a Claude Code autoresearch loop to run 43 Metal optimization experiments on Qwen3.5-397B and reach 20.34 tok/s on an M5 Max by overlapping SSD reads with GPU compute.
Research-Driven Agents: When an agent reads before it codes - LLM inference optimization: extends Karpathy's autoresearch with a literature-review phase that reads papers and competing forks before parallel llama.cpp experiments, landing five kernel fusions and about 15% faster x86 flash-attention generation in about 3 hours.
Rails controller tuning with Claude Code /loop autoresearch - Backend performance optimization: adapts Karpathy's keep/discard loop to Rails controller latency by locking benchmark scripts and test data, running 10-minute cycles, and auto-reverting regressions.
Pytest speedups via autoresearch feedback loops - Test performance optimization: applies autoresearch to a backend pytest suite with a fixed evaluation harness, seven autonomous experiments, and a 295s → 71s keep/discard improvement path.
autoresearch-sudoku - Solver optimization: uses an enhanced autoresearch loop to rewrite a Rust sudoku solver over 312 experiments and beat Tdoku plus rust_sudoku on 4 of 6 standard benchmark datasets.
autospec - Backend service generation: applies an autoresearch-inspired keep-or-revert loop to natural-language business rules, iteratively building a Spring Boot service until Gradle and JUnit evaluation pass without regression.
How I used autoresearch to fix Gumroad's flaky tests in a week - Test reliability: uses OpenClaw autoresearch to run 206 commits and 94 CI cycles that fixed 13 flaky tests while surfacing a real file-ID remapping bug.
WinMoE - Windows inference optimization: uses an AI-driven autoresearch methodology with one-change measurements and keep-or-reject ledgers to lift Qwen3.5-397B throughput from 0.44 to 1.9 tok/s on consumer hardware.
ZK Autoresearch — Plonky3 DFT Optimizer - ZK prover optimization: applies Karpathy's autoresearch pattern to Plonky3's DFT code, running Rust tests plus Criterion benchmarks and keeping only commits that reduce coset_lde_batch time on BabyBear field workloads.
autoresearch-go-ane - Apple Silicon training optimization: ports Karpathy's loop to a Go plus ANE LLM trainer, benchmarking fixed 5-minute TinyStories runs with benchstat and keeping only lower val_loss configurations.
openroad-autoresearch-ibex - Chip design optimization: applies a fixed-harness autoresearch loop to OpenROAD RTL-to-GDSII experiments on the IBEX CPU, using scout-promote screening and objective-aware history to keep only timing, area, or power improvements.
OpenCLI - Browser automation reliability: adds a Karpathy-style autoresearch harness to OpenCLI, cycling review → modify → commit → verify → decide against fixed V2EX, Zhihu, browser, and save-as-CLI eval suites to keep only reliability improvements.
autoresearch-cublas-sam3 - GPU kernel optimization: applies an autoresearch loop to SAM3 GEMM tuning by mutating one config at a time, benchmarking on real GPUs, and keeping only changes that improved throughput by 2.14% over 120 experiments on an RTX 3090.
autoresearch-mamba - Mamba training optimization: adapts Karpathy's fixed-evaluator, 5-minute keep/discard loop to MLX Mamba-2, Mamba-3, and hybrid Mamba-Transformer models on Apple Silicon by mutating one training surface to lower val_bpb.
liltrAIner - Local LLM fine-tuning optimization: applies a Karpathy-style autoresearch loop to MLX LoRA runs on Apple Silicon, letting an agent mutate training data or config, score eval prompts, and keep or revert each fine-tuning experiment.
english-app - Education app optimization: applies an autoresearch-inspired proposer → implement → test → evaluate → keep/discard loop to an English learning app, using pytest, TypeScript checks, and smoke tests to keep only changes scoring at least 6.0 across 10 autonomous iterations.
How we built the best browser agent with Auto-Research - Browser automation optimization: uses parallel Claude Code auto-research loops against Online-Mind2Web, running 20-cycle harness edits with train/validation splits and reaching 97% on the benchmark while rejecting task-specific overfits.
Speed up code with pi-autoresearch - Software performance optimization: applies pi-autoresearch to jsonista's JSON decoding benchmark, keeping only measured wins and lifting one selected benchmark's throughput by 56% while surfacing overfitting risks in accepted diffs.
588x Faster SQLite Ingestion With an Autoresearch Loop - ETL performance optimization: applies pi-autoresearch to a Python financial-data ingestion pipeline, benchmarking 50,000-row SQLite writes and keeping fixes that cut processing time from about 397s to 0.675s.
nnmetal + labrat - Apple Silicon inference optimization: uses an autonomous Zig and Metal autoresearch loop that snapshots engine files, makes one kernel change at a time, runs compile, test, and benchmark gates, and commits only throughput or latency wins above a fixed threshold.
HashSmith, Part 3: I Automated My Way to a 27% Faster Hash Table - Data-structure performance optimization: uses a Claude Code auto-optimize loop to profile, benchmark, and keep only wins on a JVM SwissTable implementation, landing three accepted changes and 13%-32% gains across eight benchmark scenarios.
claude-code-bench - AI coding workflow optimization: applies Karpathy-style autoresearch to Claude Code's 7-dimensional configuration space, running benchmark tasks and keeping only profiles that improve quality-adjusted scores for research depth, correctness, and convention adherence.
autooptimization - Systems optimization: applies a profile-first autoresearch protocol to codebases like ClickHouse, Chroma, DataFusion, and RocksDB, keeping only statistically benchmarked optimizations backed by stack-level profiling evidence.
helix-inference-opt - LLM inference optimization: applies a fixed 1-minute autoresearch benchmark to Qwen2.5-0.5B decoding on WikiText-2, rewriting only infer.py and keeping throughput gains only when bits-per-byte quality stays within a 1% guard.
autoresearch-inference-optimization - Inference serving optimization: lets an agent rewrite serve.sh plus experiment.yaml, benchmark OpenAI-compatible servers under throughput, latency, and memory constraints, and keep only higher-scoring serving configs in experiments.jsonl.
PolyTrader - Trading-system performance optimization: applies autoresearch to PolyTrader's signal-detection hot path, keeping only test-clean code changes that cut end-to-end tracker latency from 25.7 ms to 0.46 ms across a published 10-iteration benchmark run.
autoresearch-lora-buzhou - Local LoRA fine-tuning optimization: adapts autoresearch to user-chosen LoRA training goals by establishing a confirmed baseline, changing one parameter at a time, rerunning >1% wins for confirmation, and promoting only verified val_loss improvements to the best checkpoint.

Evaluation / Red Teaming

Source file: categories/evaluation-red-teaming.md

Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
autoresearch-prompt-optimization - Prompt evaluation: applies the autoresearch loop to a fixed extraction benchmark, iteratively editing one prompt and keeping only accuracy gains on the eval set.
We Used Autoresearch on Our AI Skill, It Taught Us to Write Better Tests - AI skill evaluation: runs a prompt-migration skill against six fixed codebase test cases, scores each change on correctness, completeness, and efficiency, and keeps only improvements while cherry-picking around harness overfit.
AutoPrompter - Prompt evaluation: combines promptfoo-style metrics with autoresearch-style closed-loop iteration, generating datasets, testing target models, and refining prompts through a persistent experiment ledger.
AutonomousTester - UI testing evaluation: adapts autoresearch to Playwright test generation by editing only tests/test_suite.py, measuring coverage_score, and auto-fixing or discarding test changes until coverage improves.
Autoresearch for Agents from Scratch - Support-agent prompt evaluation: applies Karpathy's keep/revert loop to system_prompt.md, scoring frozen adversarial support cases by tool-call accuracy and lifting the prompt from 0.05 to 0.80 over 15 experiments.
LLM Privacy + Cost Router — Classifier Experiment - Privacy classification evaluation: runs a Karpathy-style autoresearch experiment across regex and prompt variants for a hybrid LLM privacy classifier, validating the best configuration at 96.7% holdout accuracy with 4.6% false negatives.
AutoMemory - Agent memory evaluation: lets an agent rewrite its own memory system against LongMemEval, using an immutable evaluator over random question samples and iterating on code plus strategy notes in response to scored failures.
How to stop your autoresearch loop from cheating - Autoresearch evaluation hardening: reports 71 experiments across nanochat training and MoE compression, showing loops drift quickly unless experiments are isolated and evaluator gates block shortcut gains.
Autoreason - Output evaluation: extends Karpathy-style autoresearch to subjective writing and coding tasks by running incumbent-versus-revision-versus-synthesis tournaments under blind multi-judge Borda scoring and stopping only when the unchanged version wins twice, outperforming standard self-refinement baselines on writing tasks and 150 CodeContests problems.

Finance / Trading

Source file: categories/finance-trading.md

atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.
autoresearch-trading - Options trading: applies an autoresearch-style keep/revert loop to SPY strategy parameters, logging each experiment against backtest metrics.
autoresearch-trading - Trading research: combines Karpathy-style autoresearch with classical optimization so the agent iterates on strategy structure while an optimizer tunes parameters and walk-forward validation decides what survives.
BTCautoresearch - Bitcoin forecasting: uses Karpathy-style autoresearch to mutate a single formula file, score walk-forward out-of-sample RMSE, and keep only forecasting rules that beat the baseline power law.
autoresearch-skfolio - Portfolio optimization: edits a single portfolio-research script, runs fixed out-of-sample validation across multiple datasets and reversed-return variants, and keeps only Deflated Sharpe Ratio gains.
autoresearch-glm - Credit scoring: adapts autoresearch to Taiwan credit-default prediction by editing feature-policy code and keeping only validation AUC gains in a fixed logistic-GLM benchmark.
autoresearch-markets - Prediction-market trading research: adapts Karpathy's single-file keep/revert loop to Kalshi data, editing train.py and optimizing val_logloss on held-out resolved markets.
Simmer Autoresearch - Prediction-market trading: lets agents mutate skill configs, measure P&L or edge over live trading cycles or historical replays, and auto-commit only the configurations that improve results.
Autonomous Trading Strategy Research - Crypto trading research: adapts Karpathy's single-file autoresearch loop to Hyperliquid perpetual futures, backtesting each strategy.py change on fixed historical data and keeping only score improvements across 103 autonomous experiments.
PolyEdge AutoResearch - Prediction-market arbitrage: applies a Karpathy-style keep/discard loop to Polymarket Up/Down paper trading, mutating one strategy parameter at a time and scoring each multi-window run on P&L, fill rate, and trading frequency.
AutoResearch — Autonomous DEX Strategy Discovery - DEX trading research: applies Karpathy-style autoresearch to Base DEX strategies, backtesting one mutation at a time against real Uniswap V3 and Aerodrome data and lifting composite score from 0.421 to 8.176 over 230+ experiments.
Winning the Paradigm Prediction Market Challenge with Claude Code - Prediction-market market making: uses parallel Claude Code agents as an autoresearch swarm to generate 1,039 strategy variants, run 2,000+ evaluations, and optimize mean edge to a first-place finish in Paradigm's challenge.
Autoresearch Trading Strategy Optimizer - Crypto trading research: applies Karpathy's autoresearch to one editable strategy.py, hill-climbing on deterministic historical backtests and keeping only commits that improve final_portfolio_value / max_drawdown.
Investing Autoresearch - Trading strategy research: uses an autonomous Claude loop to rewrite strategy.py, backtest on held-out market data, and keep only strategies that improve out-of-sample Sharpe under walk-forward, slippage, and fee validation.
EMA Crossover Autoresearch - Equity trading research: adapts Karpathy's three-file autoresearch loop to an SBIN EMA strategy, mutating only strategy.py, backtesting a fixed 10-year Indian equities dataset, and keeping only changes that improve a composite return, Sharpe, and drawdown score.

Personal Knowledge / Humanities

Source file: categories/personal-knowledge-humanities.md

autoresearch-genealogy - Genealogy: uses Claude Code /autoresearch prompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault.
claude-obsidian - Personal knowledge: uses a Karpathy-style /autoresearch skill to run multi-round web research with gap-filling and file source-backed concept, entity, and synthesis pages into a compounding Obsidian wiki vault.

Infra / Skills / Forks

Source file: categories/infra-skills-forks.md

n-autoresearch - Autoresearch infra: extends Karpathy's loop with structured experiment state, multi-GPU parallelism, adaptive search, and crash recovery.
autoresearch-evaluation-harness - Evaluation infrastructure: compares autoresearch-style proposal strategies under fixed task adapters, explicit scalar scoring, and hard keep/discard gates.
autoresearch-mlx - Apple Silicon fork: ports Karpathy's autoresearch to MLX while keeping the fixed-time training budget, single mutable file, and git keep/revert loop.
Claude Autoresearch - Claude Code skill: generalizes Karpathy's autoresearch into a reusable modify → verify → keep/discard loop for measurable goals beyond ML.
claude-autoresearch - Claude Code plugin: runs autoresearch on isolated branches with deterministic verification commands, scheduled overnight sessions, and structured keep/discard reports.
lazy-developer - Claude Code plugin suite: runs repeated autoresearch phases across coverage, build speed, test speed, complexity, and performance goals while enforcing per-phase file locks and revert-on-regression behavior.
codex-autoresearch - Codex skill: brings the autoresearch pattern to Codex for unattended metric-driven software iteration with automatic keep/discard decisions.
gemini-autoresearch - Gemini CLI and Antigravity skill: runs goal-driven overnight improvement loops with verify and guard gates, keeping metric wins and automatically reverting regressions.
autoresearch-plugin - Claude Code plugin: packages the Karpathy-style experiment loop into init/test/run commands for projects with explicit evaluation scripts and git rollback.
Artificial General Research - Optimization framework: turns measurable code optimization tasks into autoresearch loops with variance-aware acceptance, artifact detection, and exhausted-approach tracking.
autoresearch-engram - Memory extension: adds persistent recall, pattern extraction, and reflection steps to Karpathy's autoresearch so the agent remembers what worked across long runs.
pi-autoresearch - pi extension: generalizes Karpathy's autoresearch into experiment tools, a live dashboard, and slash-command skills for metric-driven optimization beyond ML.
openclaw-autoresearch - OpenClaw plugin: ports pi-autoresearch to OpenClaw with pending-run enforcement, confidence scoring, checkpoint files, and git-backed keep/discard semantics.
autoresearch-opencode - OpenCode skill: ports pi-autoresearch into OpenCode as a pure skill that logs JSONL experiment runs and resumes autonomous keep/discard loops with built-in tools.
pi-autoresearch-studio - pi control plane: adds TUI and web dashboards, plan and ideas editing, and selective PR creation on top of pi-autoresearch sessions.
autoresearch-gen - Scaffold generator: interviews the user, generates a verified autoresearch experiment scaffold, auto-runs the baseline, and repairs broken generated code before handoff to the agent.
autoresearch-autoresearch - Meta-autoresearch repo: maintains a portable canonical loop distilled from karpathy/autoresearch and adjacent systems so new evidence can update a reusable agent-verifier architecture across domains.
Bilevel Autoresearch - Meta-autoresearch framework: adds outer loops that rewrite autoresearch search mechanisms themselves and reports multi-run gains on Karpathy's training benchmark.
SkyPilot parallel autoresearch - GPU infrastructure: gives Karpathy's autoresearch access to 16 GPUs so the agent can run parallel experiment waves, validate winners on faster hardware, and reach about 910 runs in about 8 hours.
autoresearch_deeplake_swarm - Cloud swarm infrastructure: extends Karpathy's loop with Modal-powered parallel workers and a shared Deeplake experiment notebook so multiple agents can explore train.py concurrently and surface only the best surviving commits.
Autoresearch on Red Hat OpenShift AI - Kubernetes ML infrastructure: runs Karpathy's autoresearch as a 24-hour OpenShift AI workload, packaging nanochat into containers that logged 198 experiments and improved validation loss by 2.3% without human intervention.
serverless-autoresearch - SageMaker infrastructure: parallelizes Karpathy's autoresearch on Spot training jobs so the agent evaluates train.py candidates with HUGI-style burst compute instead of paying for idle GPUs.
autoresearch-win-rtx - Windows GPU fork: ports Karpathy's single-file, 5-minute, val_bpb keep/discard loop to native Windows on consumer RTX GPUs.
autoresearch-amd - AMD GPU fork: ports Karpathy's single-file, 5-minute val_bpb keep/discard loop to ROCm by replacing Flash Attention 3 with portable SDPA for RDNA 4 cards.
autoloop - Agent runtime: generalizes Karpathy's autoresearch into bounded repo-level loops with inferred eval commands, explicit guardrails, and keep/discard decisions across multiple coding agents.
GOAL.md - Goal-spec framework: generalizes Karpathy's autoresearch to repos without native scalar metrics by constructing a project-specific fitness function in GOAL.md, then running measure → act → verify → keep/revert loops against it.
autoresearch-claude-code - Claude Code plugin: ports pi-autoresearch into a pure plugin skill with JSONL state, slash-command control, and autonomous keep/discard loops for arbitrary METRIC-based benchmarks.
autoresearch-benchmark - Benchmarking infrastructure: compares four autoresearch-style tools on the same sorting-throughput task and records both performance gains and iteration behavior under a shared setup.
CORAL - Multi-agent autoresearch infrastructure: runs Claude Code, Codex, or OpenCode workers in isolated worktrees, grades each attempt with coral eval, and keeps scored improvements while sharing notes and skills across agents.
autoresearch for agents - Agent evaluation template: adapts autoresearch to agent.py plus fixed run_eval.py and dataset.json, using LangSmith evals and git keep/discard decisions to improve one agent implementation.
autoresearch-automl - Benchmarking research: compares nine classical, LLM-based, and hybrid optimizers on Karpathy's nanochat task under a shared 24-hour budget, showing code-editing autoresearch is competitive but fixed-space classical HPO still wins.
autoresearch-anycloud - Cloud GPU infrastructure: wraps Karpathy's autoresearch in a unified Mac and cloud runner with platform setup, budget watchdogs, result collection, and automatic teardown across AWS, GCP, Azure, and OCI.
skill-autoresearch for Hermes Agent - Hermes skill: optimizes prompts, scripts, and validators through baseline → diagnose → patch → re-evaluate → keep/revert loops with dependency checks and conservative holdout rules.
autoresearch-anything - Claude Code skill: scaffolds Karpathy-style autoresearch pipelines for measurable business metrics by generating connectors, persistence setup, and deploy → measure → keep/discard loops around API-observable outcomes.
AutoSkill - Skill prompt optimization framework: applies Karpathy's keep/revert loop to SKILL.md, mutating one prompt at a time against test cases and improving an auto-reminder skill from 45% to 90% reliability over 60+ autonomous iterations.
EvoSkill - Skill-evolution framework: analyzes failed coding-agent trajectories, proposes skill or prompt changes, evaluates them against benchmarks, and keeps only better agent variants in a Karpathy-style self-improvement loop.
Skill Forge v2 - Skill and code optimization framework: adapts Karpathy's autoresearch to SKILL.md files and generic codebases, using dry-run validation, objective deltas, and keep/revert thresholds to steer autonomous or guided experiment loops.
autoimprove-cc - Claude Code skill optimizer: applies a Karpathy-style autoresearch loop to SKILL.md, scoring binary assertions from eval.json and committing or resetting each change based on pass-rate improvement.
ehmo/autoresearch-skill - Claude Code and Codex skill: generalizes autoresearch into clean-room red, green, and refactor teams that iteratively find issues, fix them under test, and simplify code on a feature branch while the coordinator keeps only verified progress.
ResearcherSkill - Claude Code and Codex skill: generalizes autoresearch into git-backed .lab/ sessions with branching experiment trees, convergence detection, and commit/revert control, improving Yggdrasil agent rules from 1.82 to 7.04 in a published loop.
Litmus - Parallel ML research infrastructure: turns OpenClaw into a multi-agent autoresearch lab with branch-isolated workers, scheduled director and synthesizer roles, and keep/revert experiment commits plus shared discoveries and skills.
Autoresearch CLI - Cross-agent experiment infrastructure: packages Karpathy's one-file, one-metric keep/revert loop as a Rust CLI that scaffolds configs, validates eval commands, records JSONL results, and installs slash-command skills into multiple coding agents.
codex-autoresearcher - Codex experiment infrastructure: runs optimization campaigns through separate worker and judge Codex processes, a static evaluate.sh, and schema-validated keep or restore verdicts with durable attempt forensics.
ExAutoresearch - Elixir autoresearch framework: hot-loads one experiment module at a time, trains GPT variants under fixed budgets across distributed GPU nodes, and uses a referee plus dashboard to early-stop losers and persist the best surviving trials.
slowresearch - Delayed-feedback experiment skill: adapts autoresearch to content, outreach, pricing, and other publish-and-wait workflows by logging human-reported metrics and proposing the next hypothesis across long feedback cycles.
AutoAgent - Agent-engineering infrastructure: applies Karpathy's autoresearch to a single-file Harbor agent harness, rewriting agent.py, benchmarking scored tasks, and keeping only prompt, tool, or orchestration changes that raise total score.
VibeHQ - Multi-agent coordination infrastructure: applies autoresearch to team protocol design by benchmarking agent swarms, analyzing failure logs, rewriting hub code via /optimize-protocol, and iterating until coordination flags and token waste fall.
helix - Agent-agnostic autoresearch infrastructure: generalizes Karpathy's loop into reproducible helix.yaml + program.md repos with backend-swappable agents, append-only experiments.tsv ledgers, and independently verifiable example helices.
Autolab Companion Tools - Autoresearch companion infrastructure: adds statistical keep/discard verdicts, experiment-history steering, and multi-agent branch competitions to Karpathy's GPT-pretraining loop through the autojudge, autosteer, and autoevolve CLIs.
autoresearch-cpu - CPU ML fork: ports Karpathy's autoresearch to commodity CPUs by replacing Flash Attention with native SDPA, shrinking defaults for 30-minute local runs, and preserving the same one-file val_bpb keep/discard loop without CUDA.
hugoferreira/autoresearch - Codebase research framework: generalizes Karpathy's loop into falsifiable hypotheses, isolated experiment worktrees, instrument-backed observations, strict gate review, and reusable lessons for measurable engineering goals.

Related Practices / Discussions

Source file: categories/related-practices-discussions.md

Trading / markets

mary on adapting autoresearch into AutoPredict for prediction markets - X: describes a framework that applies evaluation + mutation + selection loops to prediction-market trading agents across weather, finance, and politics.
0xAggelos on overnight autoresearch across multiple exchanges - X: claims to apply Karpathy's autoresearch pattern to trading strategy research across exchanges and symbols with thousands of experiments and no manual tuning.
alex on verifier-driven autoresearch loops for Amazon seller workflows - X: argues that verifier-driven autoresearch loops fit PPC, catalog, and ASIN-level decisions better than peer-consensus swarms because proposals are kept only when the evaluator approves them.
Brandon Pizzacalla on agentic paid ads systems based on Karpathy autoresearch - X: claims to use Karpathy-style autoresearch as the planning and experiment pattern behind autonomous paid-ads workflows.
nlethetech on a NEPSE quant terminal refined through 300+ autoresearch cycles - X: says his NEPSE terminal trading workstation backtests strategies with a quant model refined through 300+ autoresearch cycles while combining market data and execution in one interface.
tensorqt on Paradigm's autoresearch hackathon for market making - X: describes a Paradigm challenge that treated prediction-market market making as an autoresearch problem and says top teams scaled the search with parallel Claude Code or Codex workers plus internal orchestration.
xmal on using autoresearch in Naive–Power Law Blend market forecasting - X: says a new Naive–Power Law Blend asset-forecast study applied Karpathy's autoresearch framework, found the loop effective, and exposed overfitting in the initial setup.
artemg314 on a walk-forward autoresearch loop for stock portfolios - X: describes an open-source stock-portfolio research loop that edits agent.py, walks each hypothesis through two blind walk-forward tests plus a final human holdout, and reports a 0.86 Sharpe versus a 0.67 benchmark on 2022-2025 data.

Business / GTM workflows

Brandon Pizzacalla on applying Karpathy's autoresearch pattern to cold email - X: describes a single-agent loop that tests cold-email variants on live reply-rate metrics, commits winners as the new baseline, and stops at convergence.
GoatGaucho on applying Karpathy's autoresearch to TrustLayer trust scoring - X: reports using a mutable scoring config plus verifier on 120 labeled agents to push sybil-risk F1 from 0% to 95.9% in two keep-or-revert experiments before shipping the new thresholds to production.

Workflow automation / consumer ops

agrim singh on applying autoresearch to business-class fare search - X: describes an autofare loop that proposed 52 alternate routings and date shifts, validated visa and layover rules from YAML, and kept only cheaper legal itineraries, cutting one quoted business-class trip from $4,716 to $2,424 in 16 iterations.

Prompt / evaluation

Aakash Gupta on eval design as the make-or-break constraint in autoresearch - X: argues that autoresearch only transfers cleanly when the evaluator is binary, locked, compact, and specific enough to resist gaming.
野口寛士 on nightly autoresearch-style improvement runs on a Mac mini - X: says his team lets a Mac mini run overnight improvement experiments, accepted 20 automatic wins in one night, and found success-metric definition matters more than experiment volume.
Ali Amiri on matching Claude Code quality by optimizing prompts with AutoResearch - X: describes using AutoResearch to tune prompts and push a Qwen-based workflow from weak to strong performance on a large-repo task.
Clement Hoang on Headway using autoresearch for mental-health-agent prompts - X: reports that Headway used Karpathy's autoresearch to run about 80 generate → evaluate → analyze → mutate iterations on mental-health-agent prompts and reach 99% accuracy against a 95% target.
Ren on benchmark-driven skill optimization with autoresearch - X: Chinese discussion noting that autoresearch can be paired with skill creation so agents define a benchmark and then iteratively optimize the skill against it.
Aakash Gupta on porting autoresearch to prompt engineering - X: describes mapping autoresearch into a four-role prompt-optimization loop with a locked eval script, binary rubric, results log, and overnight iteration.
SonnyClawAI on adversarial second-pass evals in autoresearch - X: reports that a 4-hour write → eval → revise → measure loop only surfaced real failures after adding a second adversarial evaluation pass, making evaluator disagreement itself the useful signal.
jcyhc_ai on citation hallucinations slipping through the Autoreason evaluator stack - X: shows that nine passes of critics and judges still missed fabricated references in the Autoreason paper, highlighting evaluator blind spots in autoresearch-style research-writing loops.
Barna on benchmark-coupled correctness gates for zk-autoresearch - X: reports adding a benchmark-scale bitwise validator and nondeterminism detection because small upstream tests missed bugs that only appeared at the exact workload being optimized.
vincent_dalmaso on treating skills like products with autoresearch - X: describes running an autoresearch-skill loop with a baseline, one change at a time, and score-based keep decisions to improve skill behavior instead of rewriting prompts blindly.
0xjialin on autoresearch finding no headroom in a security-review skill - X: reports adapting autoresearch into an OpenClaw and Claude Code skill to test 10 key dimensions of the SlowMist security-review skill and finding no meaningful optimization space left.
snapolino on agents cheating when prepare.py is visible - X: reports that a nanochat-style autoresearch loop gamed a combined val_bpb + trainable tokens metric by narrowing the softmax target set once the agent could read prepare.py, arguing for hidden or locked evaluators.

Software / code workflows

0xViviennn on adapting autoresearch to GitHub engineering workflows - X: describes a Chinese engineering adaptation of autoresearch with verifier-gated code cleanup, bug fixing, test additions, and rollback across numbered cycles.
Dominic Elm on the three-file autoresearch loop - X: summarizes autoresearch as propose → train → check → keep or revert → repeat and frames the instruction file as the new bottleneck.
Franci Penov on running overnight queues of autoresearch experiments - X: describes operating multiple machines that run queued autoresearch experiments, logs, summaries, and findings overnight.
xfu on AutoResearch bug-finding persistence - X: short Chinese observation that autoresearch is unusually strong at repeatedly hunting hidden bugs until it finds them.
Kyle Boddy on using autoresearch-claude-code to tune inference hardware parameters - X: describes using an autoresearch Claude Code setup to tune testing inference machine parameters on 3090-based hardware.
Shann Holmberg on setting up autoresearch with Claude Code - X: tutorial thread that frames autoresearch as one file, one metric, and autonomous keep/revert looping inside Claude Code.
Bob on autoresearch wasting 11 days on a solved benchmark - X: highlights that a loop can stay mechanically healthy yet keep burning compute after metric saturation if it lacks usefulness-aware stop criteria.
Doğaç on GPU-kernel autoresearch needing human nudges and repeated validation - X: argues that human steering and rerunning experiments help autoresearch escape local minima and debunk noisy wins in kernel optimization.
Barna on model exhaustion signals in zk-autoresearch - X: reports that once Sonnet kept circling previously eliminated NTT ideas in the iteration memory, he treated that repetition as an exhaustion signal and switched models.
Isaac Kargar on using autoresearch to improve another AI agent - X: reports giving Claude Code one agent's codebase, running 24 keep/discard experiments, improving memory quality by 41%, and later refactoring after the loop showed the original bottleneck hypothesis was wrong.
Google hit: “Karpathy's autoresearch applied to debugging” - Reddit: discussion thread about adapting the autoresearch pattern to debugging and validation-driven repair loops.
Google hit: “Autoresearch with Claude on a real codebase (not ML)” - Reddit: discussion thread about applying the autoresearch pattern to a production codebase rather than an ML training script.
Google hit: “I used Karpathy's autoresearch pattern on product workflows” - Reddit: discussion thread suggesting the autoresearch loop is being adapted into broader product and self-improving agent workflows.
Google hit: “Autoresearch with Claude on a real codebase (not ML training)” - Reddit: discussion thread emphasizing high failure rates and many discarded experiments as normal when autoresearch is applied to real production codebases.
Google hit: “I adapted Karpathy's autoresearch to build an auto-improvement loop for agentic coding skills” - Reddit: discussion thread about applying the autoresearch loop to iterative improvement of coding-agent skills.
AutoResearch for Codex - Reddit: presents a Codex SDK optimizer that fans out candidate branches, benchmarks them under correctness tests, clones survivors, and reports about 33% inference gains plus 16× algorithm speedups.
vdaubry on generalizing autoresearch beyond codebases - X: argues that the core autoresearch pattern can generalize from code optimization to load testing, landing page A/B tests, and infrastructure tuning when the benchmark is clear.
Alex C. on turning autoresearch into a bounded debugging loop skill - X: describes deriving a bounded-experiment-loop skill from autoresearch and using it to improve agent debugging and code fixing.
kavindpadi on using pi-autoresearch for SQL optimization - X: describes trying pi-autoresearch on intentionally inefficient BigQuery SQL and suggests the same metric-driven loop can target top-cost warehouse queries under platform-specific pricing constraints.
Kr1sso on turning Instruments into a teammate for autoresearch - X: describes turning macOS Instruments into an LLM-ready profiling CLI so autoresearch loops can test CPU, Metal GPU, and memory hypotheses and keep or discard optimizations from trace data.
Darrell Thomas on an RTX 5090 CUDA kernel factory inspired by autoresearch - X: reports an AI loop that runs Nsight Compute, tunes kernels, and keeps or discards 39 CUDA variants, with DSYRK reaching 2.19× cuBLAS and quantum simulation running 2-5× faster than cuQuantum.
abhijitmjj on a 13-hour autoresearch loop for a LaTeX scanner - X: reports using Karpathy's keep/revert loop on a Markdown-to-LaTeX scanner, growing a 59-fixture corpus across 29 iterations to lift F1 from 0.896 to 1.0 while cutting real-world false positives by 78%.
latentsea on autoresearch improving a SaaS-building harness against a time-to-Realworld benchmark - Hacker News: says autoresearch improved a Claude-driven SaaS harness that builds RealWorld implementations under 90-minute budgets and scores them on test pass counts, harness quality, and completion time.

Scientific / research augmentation

Makoto Tanji on Karpathy's AutoResearch - X: explains Karpathy's autoresearch as an evolutionary search loop and highlights emerging multi-agent and parallel extensions.
Duy Nguyen on Karpathy, Tobi, and generalized autoresearch loops - X: summarizes autoresearch as edit → evaluate → keep/discard → repeat and points to its use in both GPT training and Shopify Liquid optimization.
Darian Parrish on smaller autoresearch loops for non-coding tasks - X: mentions adapting the autoresearch pattern beyond coding into other task types.
Ilya on adapting autoresearch to energy-demand peak prediction - X: describes packaging a custom scorer and composite metric so Claude Code could autoresearch seasonal calibrations and threshold interactions for energy demand prediction.
Google hit: “I built an autonomous ML agent that runs experiments on tabular data indefinitely” - Reddit: discussion thread describing a Claude Code setup that applies autoresearch loops to churn and conversion tabular tasks by iterating on features, models, and hyperparameters until gains plateau.
Gemma 4 Uncensored (autoresearch results) - Reddit: describes an automated research loop that ran 22 Gemma 4 uncensoring experiments, fixed false-positive refusal markers, and escalated from dense-only abliteration to expert-granular MoE edits when results stalled.
Dan Woods on using autoresearch plus Apple's "LLM in a Flash" to run Qwen3.5-397B locally - X: says Claude Code used Karpathy's autoresearch setup plus Apple's "LLM in a Flash" paper to get Qwen3.5-397B running on a 48 GB M3 Max MacBook.
Joe Harris on building the same autoresearch idea for robotics teams - X: claims his team built an analogous loop for robotics teams to automate experiment plumbing, debugging, and iteration around harder physical-world workflows.
David Gasquez on using the autoresearch idea in ML competitions - X: says the benchmark-driven autoresearch pattern worked in a couple of ML competition settings and generalizes to scored tasks like retrieval, AUC, and performance tuning.
Dylan Huang on a 108-experiment autoresearch run for golf forecasting - X: reports letting Claude Code run 108 no-human-loop experiments on Tinker to build a golf forecasting system, cutting held-out tournament log-loss from 2.81 to 0.54 while reverting 52% of trials.

Infra / benchmarking ideas

Versur on bringing autoresearch-style loops to Grasshopper solver workflows - X: describes using candidate generation, fixed benchmarks, scoring, and keep-only-if-improved loops for computational design experiments.
Google hit: “Autoresearch-style framework for improving heuristics” - Reddit: discussion thread about applying autoresearch-style benchmarked improvement loops to optimization heuristics under strict solver budgets.
Google hit: HN thread on applying autoresearch to LLM inference - Hacker News: discussion pointing to autoresearch-style ideas being adapted from model training to LLM inference optimization.
AutoResearch vs Classical Hyperparameter Tuning - Blog: reports a NanoChat head-to-head where autoresearch beats Optuna on sample efficiency, cost-adjusted results, and longer-horizon generalization by escaping a fixed search space.
Autoresearch Hub - Hacker News: Karpathy describes an unreleased swarm design where trusted workers verify improvements from a larger untrusted pool to parallelize autoresearch with leaderboard-style proof of progress.

Knowledge Base / RAG Preparation

Source file: categories/knowledge-base-rag-preparation.md

autoresearch-genealogy - Genealogy: uses Claude Code /autoresearch prompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault.
AutoRAGsearch - RAG retrieval optimization: applies an autoresearch-style loop to a fixed QA benchmark by editing only rag_pipeline.py, running local retrieval experiments, and improving retrieval_score from 0.9472 to 0.9867 over 20 autonomous experiments.
claude-obsidian - Knowledge-base preparation: uses a Karpathy-style /autoresearch skill to gather sources, fill research gaps across three search rounds, and write structured source, concept, entity, and synthesis pages into a retrieval-ready Obsidian vault.

Market Research

Source file: categories/market-research.md

atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.

Workflow Automation

Source file: categories/workflow-automation.md

AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
PM document optimizer - Product workflow automation: applies a Karpathy-style git ratchet to markdown artifacts like PRDs and strategy docs, scoring each draft with programmatic checks and committing only higher-scoring revisions.
Trip Optimizer Pro - Travel planning workflow automation: applies the autoresearch pattern to itinerary generation by researching destinations, scoring multi-day plans, and keeping only itinerary mutations that improve a weighted travel-quality score.

Submission format

Use exactly one line per entry:

- [Name](URL) - Industry: one-sentence description of the autoresearch use case.

How to contribute

Pick the category that best matches the direct autoresearch application domain.
Add a single-line entry in the required format to the category file, not directly to the README aggregate.
Keep the summary concrete and scannable.
Prefer examples that clearly show scenario + autoresearch loop + value.

See CONTRIBUTING.md for details.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.agents/skills/autoresearch-curation		.agents/skills/autoresearch-curation
.pi		.pi
categories		categories
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

awesome-autoresearch

Why this list

Inclusion criteria

Current coverage

Primary categories

Secondary overlap categories

Open categories still being tracked

Browse by category

Full list

Scientific Research

Software / Systems Optimization

Evaluation / Red Teaming

Finance / Trading

Personal Knowledge / Humanities

Infra / Skills / Forks

Related Practices / Discussions

Trading / markets

Business / GTM workflows

Workflow automation / consumer ops

Prompt / evaluation

Software / code workflows

Scientific / research augmentation

Infra / benchmarking ideas

Knowledge Base / RAG Preparation

Market Research

Workflow Automation

Submission format

How to contribute

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages