The Evaluator
Your go-to blog for insights on AI observability and evaluation.
What we learned testing 7 models under the same agent harness
Model swaps look like configuration changes, but they behave more like product migrations. A new model may be cheaper, faster, easier to get capacity for, or stronger on public benchmarks….
Building a self-improving agent on a context graph of human disagreement
You can build a measurably better agent from data you already have, without retraining a thing. The data is what your experienced humans do when they correct the AI. Capture…
Coding agent tracing and evaluation: An open source tool to improve AI coding workflows
Announcing coding harness tracing for observing, evaluating, and improving coding agent workflows across Claude Code, Cursor, Codex, GitHub Copilot, and Gemini CLI.
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
How we use Alyx to build Alyx: How to build an AI agent feedback loop
How Arize uses Alyx to debug Alyx: searching dense traces, aggregating failures, triaging dogfooding issues, and closing the AI engineering feedback loop.
Models got an order of magnitude better at following instructions in one year
A year ago, frontier models started losing track of instructions somewhere around 200–300 simultaneous constraints. With 2026 models, that ceiling is closer to 2,000 — an order-of-magnitude jump. We re-ran IFScale to see how, and how each model fails.
From observability to context: What’s next for Arize Phoenix
As agents start changing software, they need a way to verify their work that includes traces, evals, feedback, and APIs. This is where Phoenix goes next — not the next release, but what this product becomes.
Agent harnesses have an expiration date
A benchmark-driven look at why agent harnesses need adaptive finish logic as model behavior changes across Claude, GPT-4o, and Gemma.
AI agent evaluation: How to test, debug, and improve agents in production
Lessons from building and shipping Alyx, our AI agent
Swarm management in agent harnesses: owning long-running agents
As we have built our own harness management tools internally at Arize, and watched external systems like Devin @cognition start managing other Devins, managed agents at @AnthropicAI and long running
What is an evaluation harness?
An evaluation harness is the standardized infrastructure that decides what gets evaluated, runs the evaluation, and acts on the result.