needle-bench AI Coding Agent Leaderboard
v7.6.0 — native vendor CLI vs ostk kernel. 8 frontier models × 38 benchmarks, single-shot (samples=1), captured 2026-06-15. Cost/token Δ are resolved-gated (cells both arms solved) and per-bucket priced. Anthropic carries a full cache split; other harnesses report a usable total (basis noted per model) — see the per-cell matrix below.
Cost Δ / Token Δ = kernel vs native (negative = kernel cheaper/lighter). Click headers to sort, or jump to the per-cell matrix ↓
| Native | Kernel | Native (vendor CLI) | Kernel (ostk) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Type | Solve | Solve | Cost | Tokens | Cost | Tokens | Cost Δ | Tok Δ |
| Loading... | |||||||||
* B*: kernel solve sourced from the generic OpenRouter kernel (no hand-written native driver for this provider) — distinct from true B (kernel-cpu, native driver).
Per-cell results
Every model × benchmark. Each cell splits native │ kernel, colored by outcome. Benchmarks run hardest → easiest left to right. Tap any cell for the full token & cost breakdown.
Results
Key findings (v7.6.0)
- The kernel's cache win is Anthropic-specific. With the cache-sliver active, opus ties native on solve at −2.5% cost / −31% tokens; sonnet is −42% cost / −66% tokens. Explicit
cache_controlis Anthropic-only — that's the lever. - Outside Anthropic, the kernel costs more — it routes through clients that forfeit the provider's automatic prompt caching the native CLI gets for free. Per both-solved cell: gemini +116%, gpt-5.5 +276%. Honest inverse of the opus story.
- Harness ceilings, not just models. devstral's native (Mistral Vibe) caps at 20 turns and solved 6/38; through the kernel it reached 32/38. Real, but it's the native harness's limit being removed, not raw capability.
- Solve parity holds for the strong models (opus, gpt, gemini, deepseek within 1–3 cells across arms); the kernel's value is provider- and harness-specific, not universal.
- A genuine difficulty ceiling. Only 1 of 37 benches (
postgres-migration-schema-drift, 12%) is a true wall and 9 discriminate; 27 are floor. The matrix above shows it — benches ordered hardest→easiest.
Methodology
- 38 Docker benchmarks with real bugs (
test.shexits 0 = fixed). 2 legacy pre-kernel tests retired. - Native: vendor CLI with its own key. Kernel (B): ostk with native CpuDriver. B*: ostk via generic OpenRouter (no native driver for that provider).
- Cost summed per token bucket (fresh 1×, cache-read 0.1×, cache-create 1.25×/2×) on a rate card validated to <0.5% of real Anthropic billing.
- Efficiency Δ computed only over cells both arms solved; split-resolve, both-fail, and zero-work infra cells excluded.
- Cost/token Δ shown only for Anthropic — gemini-cli / codex / opencode native arms don't report cost (solve-rate only).
- ostk under-reports its own cost ~13%; figures here use the corrected per-bucket recompute, not ostk's self-estimate.
→ Full verified results, per-cell tables, and caveats (v7.6.0)