AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.
-
Updated
May 14, 2026 - Rust
AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.
Audit-grade multi-agent orchestration for CLI coding agents (Claude Code, Codex, Gemini CLI, +40 more). HMAC-chained audit log, signed agent cards, per-artefact lineage, air-gap deploy. The orchestrator your compliance team will sign off on. https://bernstein.run
SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasoning paths via Revision, Recombination, and Refinement, expanding the search space and escaping local optima. On SWE-bench Verified, it achieves SOTA performance
Lightweight, auditable Python code agent (~1500 LOC) — ReAct + Planner + Reflexion + Hybrid RAG, with SWE-bench Lite eval and trace replay.
An LLM council that reviews your coding agent's every move
SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).
Squeeze verbose LLM agent tool output down to only the relevant lines
Save up to 40% on agent token costs with code graphs: call graphs, dependency graphs, dead code detection, and blast radius analysis.
Lean orchestration platform for enterprise AI — where each decision costs hundreds. State machine core, HITL as a first-class state, corrections that accumulate. First use-case being Coding agent. Open research, early stage.
Open benchmark for AI coding agents on SWE-bench Verified. Compare resolution rates, cost, and unique wins.
Repository-level automated code repair agent using SWE-Bench dataset
Do MCP tools serialize in Claude Code? Empirical study: readOnlyHint controls parallelism, IPC overhead is ~5ms/call. Reproduces #14353.
Build a private evaluation dataset to optimize your organization's token costs.
Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.
A technical guide and live-tracking repository for the world's top AI models, specialized by coding, reasoning, and multimodal performance.
Fast, Multi-Cloud Sandboxes for AI Agents.
Open harness for running, measuring, and visualizing agent benchmarks. Adapters for AutomationBench, τ-bench, LeRobot, WorkArena.
Recursive Formal Alignment Factory — Autonomous pipeline for generating ultra-high-density vericoding trajectories and training SOTA code models on a single T4 GPU. Multi-agent verification, formal proofs, dense rewards, recursive self-improvement.
Add a description, image, and links to the swe-bench topic page so that developers can more easily learn about it.
To associate your repository with the swe-bench topic, visit your repo's landing page and select "manage topics."