needle-bench AI Coding Agent Leaderboard

v7.6.0 — native vendor CLI vs ostk kernel. 8 frontier models × 38 benchmarks, single-shot (samples=1), captured 2026-06-15. Cost/token Δ are resolved-gated (cells both arms solved) and per-bucket priced. Anthropic carries a full cache split; other harnesses report a usable total (basis noted per model) — see the per-cell matrix below.

Cost Δ / Token Δ = kernel vs native (negative = kernel cheaper/lighter). Click headers to sort, or jump to the per-cell matrix ↓

Native Kernel Native (vendor CLI) Kernel (ostk)
Model Type Solve Solve Cost Tokens Cost Tokens Cost Δ Tok Δ
Loading...

* B*: kernel solve sourced from the generic OpenRouter kernel (no hand-written native driver for this provider) — distinct from true B (kernel-cpu, native driver).

Per-cell results

Every model × benchmark. Each cell splits native │ kernel, colored by outcome. Benchmarks run hardest → easiest left to right. Tap any cell for the full token & cost breakdown.

solved failed timeout harness n/a
Loading matrix…

Results

Key findings (v7.6.0)

Methodology

→ Full verified results, per-cell tables, and caveats (v7.6.0)