Skip to content

yibie/awesome-autoresearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-autoresearch

A curated awesome list of public autoresearch use cases across industries.

This README is the homepage aggregate of the current category files, so the latest accepted entries are visible here without drilling into subpages.

The repository distinguishes between:

  • primary categories for stronger case evidence such as repos, project pages, and concrete write-ups
  • secondary overlap categories for cross-cutting patterns that reuse the same evidence from another angle
  • Related Practices / Discussions for credible public practice signals — especially X threads, Reddit discussions, and interviews — that describe real autoresearch usage even when no strong standalone case page exists yet.

Why this list

Most discussions about autoresearch are still scattered, vague, or overly tool-centric. This list is designed to answer two practical questions quickly:

  • Where has autoresearch already been used in real workflows?
  • Which patterns can transfer across industries?

This is not a comprehensive database. It is a high-signal, fast-scanning field guide.

Inclusion criteria

An entry should meet all of the following:

  • The source is public and citable.
  • The example is directly related to autoresearch, not just a generic research or monitoring agent.
  • The source explicitly mentions autoresearch, cites Karpathy's autoresearch, or clearly shows a modify → verify → keep/discard → repeat loop.
  • The summary explains the scenario, method, and value in one sentence.

We do not include:

  • Generic research agents, monitoring agents, or multi-agent systems with no explicit autoresearch loop.
  • Pure theory or opinion without a concrete practice.
  • Generic AI commentary with no autoresearch workflow.
  • Long write-ups inside the list itself.
  • Sources that are private, inaccessible, or too vague to classify.

Current coverage

Primary categories

Secondary overlap categories

Open categories still being tracked

Some entries intentionally appear in more than one overlap category when the same project is both a domain case and a reusable workflow pattern.

Browse by category

Full list

Scientific Research

Source file: categories/scientific-research.md

  • AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
  • Sibyl Research System - Scientific research: builds a fully autonomous AI scientist on Claude Code with inner research-iteration loops and outer self-evolution across projects.
  • autoresearch-rl - RL research: applies the autoresearch pattern to RL post-training by iterating on one training config, running fixed-time experiments, and keeping only eval improvements.
  • autoresearch-robotics - Robotics research: adapts Karpathy-style autoresearch to MuJoCo and Gymnasium robotics tasks by editing one training file, evaluating fixed-budget runs, and using simulator renderings plus vision feedback to keep only better policies.
  • Tinker-Explorer - Evidence-retrieval research: adapts the autoresearch pattern to GRPO document exploration, comparing reward designs and keeping only retrieval policies that answer multihop questions more accurately under a token budget.
  • Autoresearch on an old research idea - Multimodal retrieval research: applies Claude Code autoresearch to an old eCLIP idea, running 42 fixed-budget experiments with commit/revert decisions and cutting mean rank from 344.68 to 157.43.
  • autoresearch-at-home - Distributed ML research: coordinates a SETI@home-style swarm of agents that claim experiments, share full train.py results through Ensue, and collectively drive down val_bpb across different GPUs.
  • autoresearch-paper-benchmark - Graph ML research: runs paper-driven campaigns on a fixed Peptides-func benchmark by editing train.py, logging 300-second experiments, and testing only the best validation-AP model at campaign end.
  • autoresearch-cifar10 - Vision research: applies autoresearch to CIFAR-10 ResNet training on a 3090, iterating under fixed time budgets and keeping changes that lift accuracy beyond a 91.89% baseline.
  • AutoResearch-GenPose - Vision research: adapts autoresearch to CIFAR-10 UNet denoising by editing one training file, running fixed 5-minute experiments, and keeping only val_psnr improvements.
  • MLP-AutoResearch - MNIST training research: ports Karpathy's single-file loop to an MLP classifier, fixing 20-epoch runs and greedy keep/revert decisions that raised handwritten-digit accuracy from 0.9809 to 0.9836.
  • autoresearch-medimage - Medical imaging research: adapts Karpathy's prepare.py + train.py + results.tsv loop to 2D imaging tasks, using short-budget candidate discovery and staged follow-up validation to surface stronger ChestXray14 models.
  • autocircuit - Analog circuit optimization: adapts Karpathy's autoresearch to a SKY130 two-stage op-amp, editing optimize.py, running ngspice, and keeping only parameter changes that expand the GBW-versus-power Pareto front under phase-margin constraints.
  • fe-autoresearch - Tabular ML research: applies the autoresearch loop to LightGBM feature engineering on the UCI Bank Marketing dataset by editing one engineer_features() target, training against fixed AUC metrics, and keeping only improvements.
  • Paper Lantern improves Autoresearch - ML research augmentation: connects a 2M-paper MCP server to autoresearch, letting the agent cite 100 papers across 100 experiments and reach a 3.2% lower 2-hour validation loss than the same run without paper access.
  • Subtractive Search in a Mature Tabular Pipeline - Tabular ML research: applies Karpathy's autoresearch to a churn-prediction XGBoost pipeline, running 116 autonomous experiments and lifting subsample AUC from 0.902892 to 0.916721 largely by removing noisy target-encoded features.
  • autoresearch-connect4 - Game AI research: adapts Karpathy's three-file autoresearch loop to Connect Four by editing train.py, training 5-minute self-play runs, and keeping only changes that improve weighted win rate against fixed opponents.
  • autoresearch-tabular - Tabular ML research: adapts Karpathy's three-file loop to the Adult Income benchmark by editing only train.py, running fixed 2-minute experiments, and keeping only val_auc improvements.
  • ocr-autoresearch - OCR research: adapts autoresearch to ICDAR2015 scene-text recognition by editing one train.py, running fixed 5-minute CRNN+CTC experiments, and keeping only lower validation character error rates.
  • Tennis XGBoost Autoresearch - Sports analytics research: applies a Karpathy-style keep/revert loop to a 245K-match tennis XGBoost pipeline, then hardens the evaluator after the agent learned to game mutable ROC-AUC scoring.
  • Bio-Autoresearch - Drug discovery research: applies a Karpathy-style autoresearch loop to rare-disease drug repurposing on PrimeKG, running 15 GPU experiments with keep/revert decisions and lifting held-out per-disease AUPRC from 0.284 to 0.761.
  • autoresearch-quantum - Quantum research: runs incumbent/challenger autoresearch ratchets for encoded magic-state experiments, screens candidates on cheap noisy simulations, and promotes only justified challengers to expensive backends while logging transferable lessons.
  • kaggle-autoresearch - Tabular ML research: adapts Karpathy-style autoresearch to Kaggle competitions such as Titanic, House Prices, and Store Sales by iterating on feature and model code, logging approved baselines, and accepting only cross-validation improvements over fixed thresholds.
  • MiniMax M2.7: Early Echoes of Self-Evolution - AI-lab research: describes an internal research agent that automates 30%-50% of RL workflow and a 100+ round keep/revert scaffold-optimization loop that improved internal evaluation scores by 30%.
  • autoresearch-macro - Macroeconomic forecasting research: runs LLM-guided outer-loop search over Chronos-2 covariates, transforms, and fine-tuning settings, keeping only validation-era forecast improvements across pseudo-real-time Norway, Canada, and Sweden benchmarks.
  • autoresearch-dqn - RL algorithm research: applies the autoresearch loop to a CartPole training script, logging 39 iterations that replaced an unstable DQN baseline with a REINFORCE agent that reaches reward 500 in about 5 seconds instead of about 3 minutes.
  • AutoMedal - Kaggle competition research: adapts Karpathy's keep/revert loop into strategist, researcher, and experimenter phases, journaling 24 tabular-competition experiments and keeping only lower val_loss changes on a fixed leaderboard-oriented harness.
  • autoresearch-qwen - Document VQA research: adapts Karpathy's keep/discard loop to Qwen3-VL on the official DocVQA benchmark by fixing evaluate.py, limiting edits to train.py, and accepting only higher full-validation ANLS scores.

Software / Systems Optimization

Source file: categories/software-systems-optimization.md

  • karpathy/autoresearch - ML training optimization: the original autoresearch loop where an agent edits a GPT training script, runs fixed-time experiments, and keeps only improvements in validation bits-per-byte.
  • AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
  • autoresearch-webgpu - Browser ML optimization: ports Karpathy's autoresearch into the browser so agents can generate training code, run GPU-backed experiments, and feed losses back into the next iteration.
  • autoresearch-local-llm - Local ML optimization: replaces Claude Code with a local Qwen model to run the standard autoresearch keep/revert loop on a shared single GPU.
  • Shopify Liquid performance work via autoresearch - Software optimization: Tobi Lütke applied an autoresearch loop to Shopify's Liquid template engine, producing 93 automated commits that improved parse+render performance by 53% with 61% fewer allocations.
  • Autoresearch for SAT Solvers - SAT solver optimization: runs parallel MaxSAT experiments, updates reusable solver code plus expert memory, and improves public benchmark configurations against 2024 competition baselines.
  • autoresearch — Heuristic CP Edition - Heuristic solver optimization: adapts autoresearch to C++ competitive-programming solvers by editing only solver.cpp, scoring fixed benchmark instances, and keeping only lower average solution costs.
  • Autoresearch for game development - HTML5 Game Development: Runs agents to build better games based on player feedback and usage metrics, benchmarks using game ELO in 1/1 matchups.
  • SiliconSwarm@Ensue - Apple Silicon inference optimization: uses a multi-agent autoresearch loop to test ANE graph changes across chips and reports up to 6.31× lower median DistilBERT latency than CoreML.
  • Flash-MoE - Apple Silicon inference optimization: uses a Claude Code autoresearch loop to run 43 Metal optimization experiments on Qwen3.5-397B and reach 20.34 tok/s on an M5 Max by overlapping SSD reads with GPU compute.
  • Research-Driven Agents: When an agent reads before it codes - LLM inference optimization: extends Karpathy's autoresearch with a literature-review phase that reads papers and competing forks before parallel llama.cpp experiments, landing five kernel fusions and about 15% faster x86 flash-attention generation in about 3 hours.
  • Rails controller tuning with Claude Code /loop autoresearch - Backend performance optimization: adapts Karpathy's keep/discard loop to Rails controller latency by locking benchmark scripts and test data, running 10-minute cycles, and auto-reverting regressions.
  • Pytest speedups via autoresearch feedback loops - Test performance optimization: applies autoresearch to a backend pytest suite with a fixed evaluation harness, seven autonomous experiments, and a 295s → 71s keep/discard improvement path.
  • autoresearch-sudoku - Solver optimization: uses an enhanced autoresearch loop to rewrite a Rust sudoku solver over 312 experiments and beat Tdoku plus rust_sudoku on 4 of 6 standard benchmark datasets.
  • autospec - Backend service generation: applies an autoresearch-inspired keep-or-revert loop to natural-language business rules, iteratively building a Spring Boot service until Gradle and JUnit evaluation pass without regression.
  • How I used autoresearch to fix Gumroad's flaky tests in a week - Test reliability: uses OpenClaw autoresearch to run 206 commits and 94 CI cycles that fixed 13 flaky tests while surfacing a real file-ID remapping bug.
  • WinMoE - Windows inference optimization: uses an AI-driven autoresearch methodology with one-change measurements and keep-or-reject ledgers to lift Qwen3.5-397B throughput from 0.44 to 1.9 tok/s on consumer hardware.
  • ZK Autoresearch — Plonky3 DFT Optimizer - ZK prover optimization: applies Karpathy's autoresearch pattern to Plonky3's DFT code, running Rust tests plus Criterion benchmarks and keeping only commits that reduce coset_lde_batch time on BabyBear field workloads.
  • autoresearch-go-ane - Apple Silicon training optimization: ports Karpathy's loop to a Go plus ANE LLM trainer, benchmarking fixed 5-minute TinyStories runs with benchstat and keeping only lower val_loss configurations.
  • openroad-autoresearch-ibex - Chip design optimization: applies a fixed-harness autoresearch loop to OpenROAD RTL-to-GDSII experiments on the IBEX CPU, using scout-promote screening and objective-aware history to keep only timing, area, or power improvements.
  • OpenCLI - Browser automation reliability: adds a Karpathy-style autoresearch harness to OpenCLI, cycling review → modify → commit → verify → decide against fixed V2EX, Zhihu, browser, and save-as-CLI eval suites to keep only reliability improvements.
  • autoresearch-cublas-sam3 - GPU kernel optimization: applies an autoresearch loop to SAM3 GEMM tuning by mutating one config at a time, benchmarking on real GPUs, and keeping only changes that improved throughput by 2.14% over 120 experiments on an RTX 3090.
  • autoresearch-mamba - Mamba training optimization: adapts Karpathy's fixed-evaluator, 5-minute keep/discard loop to MLX Mamba-2, Mamba-3, and hybrid Mamba-Transformer models on Apple Silicon by mutating one training surface to lower val_bpb.
  • liltrAIner - Local LLM fine-tuning optimization: applies a Karpathy-style autoresearch loop to MLX LoRA runs on Apple Silicon, letting an agent mutate training data or config, score eval prompts, and keep or revert each fine-tuning experiment.
  • english-app - Education app optimization: applies an autoresearch-inspired proposer → implement → test → evaluate → keep/discard loop to an English learning app, using pytest, TypeScript checks, and smoke tests to keep only changes scoring at least 6.0 across 10 autonomous iterations.
  • How we built the best browser agent with Auto-Research - Browser automation optimization: uses parallel Claude Code auto-research loops against Online-Mind2Web, running 20-cycle harness edits with train/validation splits and reaching 97% on the benchmark while rejecting task-specific overfits.
  • Speed up code with pi-autoresearch - Software performance optimization: applies pi-autoresearch to jsonista's JSON decoding benchmark, keeping only measured wins and lifting one selected benchmark's throughput by 56% while surfacing overfitting risks in accepted diffs.
  • 588x Faster SQLite Ingestion With an Autoresearch Loop - ETL performance optimization: applies pi-autoresearch to a Python financial-data ingestion pipeline, benchmarking 50,000-row SQLite writes and keeping fixes that cut processing time from about 397s to 0.675s.
  • nnmetal + labrat - Apple Silicon inference optimization: uses an autonomous Zig and Metal autoresearch loop that snapshots engine files, makes one kernel change at a time, runs compile, test, and benchmark gates, and commits only throughput or latency wins above a fixed threshold.
  • HashSmith, Part 3: I Automated My Way to a 27% Faster Hash Table - Data-structure performance optimization: uses a Claude Code auto-optimize loop to profile, benchmark, and keep only wins on a JVM SwissTable implementation, landing three accepted changes and 13%-32% gains across eight benchmark scenarios.
  • claude-code-bench - AI coding workflow optimization: applies Karpathy-style autoresearch to Claude Code's 7-dimensional configuration space, running benchmark tasks and keeping only profiles that improve quality-adjusted scores for research depth, correctness, and convention adherence.
  • autooptimization - Systems optimization: applies a profile-first autoresearch protocol to codebases like ClickHouse, Chroma, DataFusion, and RocksDB, keeping only statistically benchmarked optimizations backed by stack-level profiling evidence.
  • helix-inference-opt - LLM inference optimization: applies a fixed 1-minute autoresearch benchmark to Qwen2.5-0.5B decoding on WikiText-2, rewriting only infer.py and keeping throughput gains only when bits-per-byte quality stays within a 1% guard.
  • autoresearch-inference-optimization - Inference serving optimization: lets an agent rewrite serve.sh plus experiment.yaml, benchmark OpenAI-compatible servers under throughput, latency, and memory constraints, and keep only higher-scoring serving configs in experiments.jsonl.
  • PolyTrader - Trading-system performance optimization: applies autoresearch to PolyTrader's signal-detection hot path, keeping only test-clean code changes that cut end-to-end tracker latency from 25.7 ms to 0.46 ms across a published 10-iteration benchmark run.
  • autoresearch-lora-buzhou - Local LoRA fine-tuning optimization: adapts autoresearch to user-chosen LoRA training goals by establishing a confirmed baseline, changing one parameter at a time, rerunning >1% wins for confirmation, and promoting only verified val_loss improvements to the best checkpoint.

Evaluation / Red Teaming

Source file: categories/evaluation-red-teaming.md

  • Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
  • autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
  • autoresearch-prompt-optimization - Prompt evaluation: applies the autoresearch loop to a fixed extraction benchmark, iteratively editing one prompt and keeping only accuracy gains on the eval set.
  • We Used Autoresearch on Our AI Skill, It Taught Us to Write Better Tests - AI skill evaluation: runs a prompt-migration skill against six fixed codebase test cases, scores each change on correctness, completeness, and efficiency, and keeps only improvements while cherry-picking around harness overfit.
  • AutoPrompter - Prompt evaluation: combines promptfoo-style metrics with autoresearch-style closed-loop iteration, generating datasets, testing target models, and refining prompts through a persistent experiment ledger.
  • AutonomousTester - UI testing evaluation: adapts autoresearch to Playwright test generation by editing only tests/test_suite.py, measuring coverage_score, and auto-fixing or discarding test changes until coverage improves.
  • Autoresearch for Agents from Scratch - Support-agent prompt evaluation: applies Karpathy's keep/revert loop to system_prompt.md, scoring frozen adversarial support cases by tool-call accuracy and lifting the prompt from 0.05 to 0.80 over 15 experiments.
  • LLM Privacy + Cost Router — Classifier Experiment - Privacy classification evaluation: runs a Karpathy-style autoresearch experiment across regex and prompt variants for a hybrid LLM privacy classifier, validating the best configuration at 96.7% holdout accuracy with 4.6% false negatives.
  • AutoMemory - Agent memory evaluation: lets an agent rewrite its own memory system against LongMemEval, using an immutable evaluator over random question samples and iterating on code plus strategy notes in response to scored failures.
  • How to stop your autoresearch loop from cheating - Autoresearch evaluation hardening: reports 71 experiments across nanochat training and MoE compression, showing loops drift quickly unless experiments are isolated and evaluator gates block shortcut gains.
  • Autoreason - Output evaluation: extends Karpathy-style autoresearch to subjective writing and coding tasks by running incumbent-versus-revision-versus-synthesis tournaments under blind multi-judge Borda scoring and stopping only when the unchanged version wins twice, outperforming standard self-refinement baselines on writing tasks and 150 CodeContests problems.

Finance / Trading

Source file: categories/finance-trading.md

  • atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.
  • autoresearch-trading - Options trading: applies an autoresearch-style keep/revert loop to SPY strategy parameters, logging each experiment against backtest metrics.
  • autoresearch-trading - Trading research: combines Karpathy-style autoresearch with classical optimization so the agent iterates on strategy structure while an optimizer tunes parameters and walk-forward validation decides what survives.
  • BTCautoresearch - Bitcoin forecasting: uses Karpathy-style autoresearch to mutate a single formula file, score walk-forward out-of-sample RMSE, and keep only forecasting rules that beat the baseline power law.
  • autoresearch-skfolio - Portfolio optimization: edits a single portfolio-research script, runs fixed out-of-sample validation across multiple datasets and reversed-return variants, and keeps only Deflated Sharpe Ratio gains.
  • autoresearch-glm - Credit scoring: adapts autoresearch to Taiwan credit-default prediction by editing feature-policy code and keeping only validation AUC gains in a fixed logistic-GLM benchmark.
  • autoresearch-markets - Prediction-market trading research: adapts Karpathy's single-file keep/revert loop to Kalshi data, editing train.py and optimizing val_logloss on held-out resolved markets.
  • Simmer Autoresearch - Prediction-market trading: lets agents mutate skill configs, measure P&L or edge over live trading cycles or historical replays, and auto-commit only the configurations that improve results.
  • Autonomous Trading Strategy Research - Crypto trading research: adapts Karpathy's single-file autoresearch loop to Hyperliquid perpetual futures, backtesting each strategy.py change on fixed historical data and keeping only score improvements across 103 autonomous experiments.
  • PolyEdge AutoResearch - Prediction-market arbitrage: applies a Karpathy-style keep/discard loop to Polymarket Up/Down paper trading, mutating one strategy parameter at a time and scoring each multi-window run on P&L, fill rate, and trading frequency.
  • AutoResearch — Autonomous DEX Strategy Discovery - DEX trading research: applies Karpathy-style autoresearch to Base DEX strategies, backtesting one mutation at a time against real Uniswap V3 and Aerodrome data and lifting composite score from 0.421 to 8.176 over 230+ experiments.
  • Winning the Paradigm Prediction Market Challenge with Claude Code - Prediction-market market making: uses parallel Claude Code agents as an autoresearch swarm to generate 1,039 strategy variants, run 2,000+ evaluations, and optimize mean edge to a first-place finish in Paradigm's challenge.
  • Autoresearch Trading Strategy Optimizer - Crypto trading research: applies Karpathy's autoresearch to one editable strategy.py, hill-climbing on deterministic historical backtests and keeping only commits that improve final_portfolio_value / max_drawdown.
  • Investing Autoresearch - Trading strategy research: uses an autonomous Claude loop to rewrite strategy.py, backtest on held-out market data, and keep only strategies that improve out-of-sample Sharpe under walk-forward, slippage, and fee validation.
  • EMA Crossover Autoresearch - Equity trading research: adapts Karpathy's three-file autoresearch loop to an SBIN EMA strategy, mutating only strategy.py, backtesting a fixed 10-year Indian equities dataset, and keeping only changes that improve a composite return, Sharpe, and drawdown score.

Personal Knowledge / Humanities

Source file: categories/personal-knowledge-humanities.md

  • autoresearch-genealogy - Genealogy: uses Claude Code /autoresearch prompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault.
  • claude-obsidian - Personal knowledge: uses a Karpathy-style /autoresearch skill to run multi-round web research with gap-filling and file source-backed concept, entity, and synthesis pages into a compounding Obsidian wiki vault.

Infra / Skills / Forks

Source file: categories/infra-skills-forks.md

  • n-autoresearch - Autoresearch infra: extends Karpathy's loop with structured experiment state, multi-GPU parallelism, adaptive search, and crash recovery.
  • autoresearch-evaluation-harness - Evaluation infrastructure: compares autoresearch-style proposal strategies under fixed task adapters, explicit scalar scoring, and hard keep/discard gates.
  • autoresearch-mlx - Apple Silicon fork: ports Karpathy's autoresearch to MLX while keeping the fixed-time training budget, single mutable file, and git keep/revert loop.
  • Claude Autoresearch - Claude Code skill: generalizes Karpathy's autoresearch into a reusable modify → verify → keep/discard loop for measurable goals beyond ML.
  • claude-autoresearch - Claude Code plugin: runs autoresearch on isolated branches with deterministic verification commands, scheduled overnight sessions, and structured keep/discard reports.
  • lazy-developer - Claude Code plugin suite: runs repeated autoresearch phases across coverage, build speed, test speed, complexity, and performance goals while enforcing per-phase file locks and revert-on-regression behavior.
  • codex-autoresearch - Codex skill: brings the autoresearch pattern to Codex for unattended metric-driven software iteration with automatic keep/discard decisions.
  • gemini-autoresearch - Gemini CLI and Antigravity skill: runs goal-driven overnight improvement loops with verify and guard gates, keeping metric wins and automatically reverting regressions.
  • autoresearch-plugin - Claude Code plugin: packages the Karpathy-style experiment loop into init/test/run commands for projects with explicit evaluation scripts and git rollback.
  • Artificial General Research - Optimization framework: turns measurable code optimization tasks into autoresearch loops with variance-aware acceptance, artifact detection, and exhausted-approach tracking.
  • autoresearch-engram - Memory extension: adds persistent recall, pattern extraction, and reflection steps to Karpathy's autoresearch so the agent remembers what worked across long runs.
  • pi-autoresearch - pi extension: generalizes Karpathy's autoresearch into experiment tools, a live dashboard, and slash-command skills for metric-driven optimization beyond ML.
  • openclaw-autoresearch - OpenClaw plugin: ports pi-autoresearch to OpenClaw with pending-run enforcement, confidence scoring, checkpoint files, and git-backed keep/discard semantics.
  • autoresearch-opencode - OpenCode skill: ports pi-autoresearch into OpenCode as a pure skill that logs JSONL experiment runs and resumes autonomous keep/discard loops with built-in tools.
  • pi-autoresearch-studio - pi control plane: adds TUI and web dashboards, plan and ideas editing, and selective PR creation on top of pi-autoresearch sessions.
  • autoresearch-gen - Scaffold generator: interviews the user, generates a verified autoresearch experiment scaffold, auto-runs the baseline, and repairs broken generated code before handoff to the agent.
  • autoresearch-autoresearch - Meta-autoresearch repo: maintains a portable canonical loop distilled from karpathy/autoresearch and adjacent systems so new evidence can update a reusable agent-verifier architecture across domains.
  • Bilevel Autoresearch - Meta-autoresearch framework: adds outer loops that rewrite autoresearch search mechanisms themselves and reports multi-run gains on Karpathy's training benchmark.
  • SkyPilot parallel autoresearch - GPU infrastructure: gives Karpathy's autoresearch access to 16 GPUs so the agent can run parallel experiment waves, validate winners on faster hardware, and reach about 910 runs in about 8 hours.
  • autoresearch_deeplake_swarm - Cloud swarm infrastructure: extends Karpathy's loop with Modal-powered parallel workers and a shared Deeplake experiment notebook so multiple agents can explore train.py concurrently and surface only the best surviving commits.
  • Autoresearch on Red Hat OpenShift AI - Kubernetes ML infrastructure: runs Karpathy's autoresearch as a 24-hour OpenShift AI workload, packaging nanochat into containers that logged 198 experiments and improved validation loss by 2.3% without human intervention.
  • serverless-autoresearch - SageMaker infrastructure: parallelizes Karpathy's autoresearch on Spot training jobs so the agent evaluates train.py candidates with HUGI-style burst compute instead of paying for idle GPUs.
  • autoresearch-win-rtx - Windows GPU fork: ports Karpathy's single-file, 5-minute, val_bpb keep/discard loop to native Windows on consumer RTX GPUs.
  • autoresearch-amd - AMD GPU fork: ports Karpathy's single-file, 5-minute val_bpb keep/discard loop to ROCm by replacing Flash Attention 3 with portable SDPA for RDNA 4 cards.
  • autoloop - Agent runtime: generalizes Karpathy's autoresearch into bounded repo-level loops with inferred eval commands, explicit guardrails, and keep/discard decisions across multiple coding agents.
  • GOAL.md - Goal-spec framework: generalizes Karpathy's autoresearch to repos without native scalar metrics by constructing a project-specific fitness function in GOAL.md, then running measure → act → verify → keep/revert loops against it.
  • autoresearch-claude-code - Claude Code plugin: ports pi-autoresearch into a pure plugin skill with JSONL state, slash-command control, and autonomous keep/discard loops for arbitrary METRIC-based benchmarks.
  • autoresearch-benchmark - Benchmarking infrastructure: compares four autoresearch-style tools on the same sorting-throughput task and records both performance gains and iteration behavior under a shared setup.
  • CORAL - Multi-agent autoresearch infrastructure: runs Claude Code, Codex, or OpenCode workers in isolated worktrees, grades each attempt with coral eval, and keeps scored improvements while sharing notes and skills across agents.
  • autoresearch for agents - Agent evaluation template: adapts autoresearch to agent.py plus fixed run_eval.py and dataset.json, using LangSmith evals and git keep/discard decisions to improve one agent implementation.
  • autoresearch-automl - Benchmarking research: compares nine classical, LLM-based, and hybrid optimizers on Karpathy's nanochat task under a shared 24-hour budget, showing code-editing autoresearch is competitive but fixed-space classical HPO still wins.
  • autoresearch-anycloud - Cloud GPU infrastructure: wraps Karpathy's autoresearch in a unified Mac and cloud runner with platform setup, budget watchdogs, result collection, and automatic teardown across AWS, GCP, Azure, and OCI.
  • skill-autoresearch for Hermes Agent - Hermes skill: optimizes prompts, scripts, and validators through baseline → diagnose → patch → re-evaluate → keep/revert loops with dependency checks and conservative holdout rules.
  • autoresearch-anything - Claude Code skill: scaffolds Karpathy-style autoresearch pipelines for measurable business metrics by generating connectors, persistence setup, and deploy → measure → keep/discard loops around API-observable outcomes.
  • AutoSkill - Skill prompt optimization framework: applies Karpathy's keep/revert loop to SKILL.md, mutating one prompt at a time against test cases and improving an auto-reminder skill from 45% to 90% reliability over 60+ autonomous iterations.
  • EvoSkill - Skill-evolution framework: analyzes failed coding-agent trajectories, proposes skill or prompt changes, evaluates them against benchmarks, and keeps only better agent variants in a Karpathy-style self-improvement loop.
  • Skill Forge v2 - Skill and code optimization framework: adapts Karpathy's autoresearch to SKILL.md files and generic codebases, using dry-run validation, objective deltas, and keep/revert thresholds to steer autonomous or guided experiment loops.
  • autoimprove-cc - Claude Code skill optimizer: applies a Karpathy-style autoresearch loop to SKILL.md, scoring binary assertions from eval.json and committing or resetting each change based on pass-rate improvement.
  • ehmo/autoresearch-skill - Claude Code and Codex skill: generalizes autoresearch into clean-room red, green, and refactor teams that iteratively find issues, fix them under test, and simplify code on a feature branch while the coordinator keeps only verified progress.
  • ResearcherSkill - Claude Code and Codex skill: generalizes autoresearch into git-backed .lab/ sessions with branching experiment trees, convergence detection, and commit/revert control, improving Yggdrasil agent rules from 1.82 to 7.04 in a published loop.
  • Litmus - Parallel ML research infrastructure: turns OpenClaw into a multi-agent autoresearch lab with branch-isolated workers, scheduled director and synthesizer roles, and keep/revert experiment commits plus shared discoveries and skills.
  • Autoresearch CLI - Cross-agent experiment infrastructure: packages Karpathy's one-file, one-metric keep/revert loop as a Rust CLI that scaffolds configs, validates eval commands, records JSONL results, and installs slash-command skills into multiple coding agents.
  • codex-autoresearcher - Codex experiment infrastructure: runs optimization campaigns through separate worker and judge Codex processes, a static evaluate.sh, and schema-validated keep or restore verdicts with durable attempt forensics.
  • ExAutoresearch - Elixir autoresearch framework: hot-loads one experiment module at a time, trains GPT variants under fixed budgets across distributed GPU nodes, and uses a referee plus dashboard to early-stop losers and persist the best surviving trials.
  • slowresearch - Delayed-feedback experiment skill: adapts autoresearch to content, outreach, pricing, and other publish-and-wait workflows by logging human-reported metrics and proposing the next hypothesis across long feedback cycles.
  • AutoAgent - Agent-engineering infrastructure: applies Karpathy's autoresearch to a single-file Harbor agent harness, rewriting agent.py, benchmarking scored tasks, and keeping only prompt, tool, or orchestration changes that raise total score.
  • VibeHQ - Multi-agent coordination infrastructure: applies autoresearch to team protocol design by benchmarking agent swarms, analyzing failure logs, rewriting hub code via /optimize-protocol, and iterating until coordination flags and token waste fall.
  • helix - Agent-agnostic autoresearch infrastructure: generalizes Karpathy's loop into reproducible helix.yaml + program.md repos with backend-swappable agents, append-only experiments.tsv ledgers, and independently verifiable example helices.
  • Autolab Companion Tools - Autoresearch companion infrastructure: adds statistical keep/discard verdicts, experiment-history steering, and multi-agent branch competitions to Karpathy's GPT-pretraining loop through the autojudge, autosteer, and autoevolve CLIs.
  • autoresearch-cpu - CPU ML fork: ports Karpathy's autoresearch to commodity CPUs by replacing Flash Attention with native SDPA, shrinking defaults for 30-minute local runs, and preserving the same one-file val_bpb keep/discard loop without CUDA.
  • hugoferreira/autoresearch - Codebase research framework: generalizes Karpathy's loop into falsifiable hypotheses, isolated experiment worktrees, instrument-backed observations, strict gate review, and reusable lessons for measurable engineering goals.

Related Practices / Discussions

Source file: categories/related-practices-discussions.md

Trading / markets

Business / GTM workflows

Workflow automation / consumer ops

Prompt / evaluation

Software / code workflows

Scientific / research augmentation

Infra / benchmarking ideas

Knowledge Base / RAG Preparation

Source file: categories/knowledge-base-rag-preparation.md

  • autoresearch-genealogy - Genealogy: uses Claude Code /autoresearch prompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault.
  • AutoRAGsearch - RAG retrieval optimization: applies an autoresearch-style loop to a fixed QA benchmark by editing only rag_pipeline.py, running local retrieval experiments, and improving retrieval_score from 0.9472 to 0.9867 over 20 autonomous experiments.
  • claude-obsidian - Knowledge-base preparation: uses a Karpathy-style /autoresearch skill to gather sources, fill research gaps across three search rounds, and write structured source, concept, entity, and synthesis pages into a retrieval-ready Obsidian vault.

Market Research

Source file: categories/market-research.md

  • atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.

Workflow Automation

Source file: categories/workflow-automation.md

  • AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
  • Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
  • AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
  • autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
  • PM document optimizer - Product workflow automation: applies a Karpathy-style git ratchet to markdown artifacts like PRDs and strategy docs, scoring each draft with programmatic checks and committing only higher-scoring revisions.
  • Trip Optimizer Pro - Travel planning workflow automation: applies the autoresearch pattern to itinerary generation by researching destinations, scoring multi-day plans, and keeping only itinerary mutations that improve a weighted travel-quality score.

Submission format

Use exactly one line per entry:

- [Name](URL) - Industry: one-sentence description of the autoresearch use case.

How to contribute

  1. Pick the category that best matches the direct autoresearch application domain.
  2. Add a single-line entry in the required format to the category file, not directly to the README aggregate.
  3. Keep the summary concrete and scannable.
  4. Prefer examples that clearly show scenario + autoresearch loop + value.

See CONTRIBUTING.md for details.

License

MIT

About

awesome autoresearch list

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors