The Evaluator
Your go-to blog for insights on AI observability and evaluation.
Closing the Loop: Coding Agents, Telemetry, and the Path to Self-Improving Software
2025 marked the widespread adoption of coding agents — harnesses that autonomously write, test, and debug changes to software with minimal human intervention. Products like Claude Code, Codex, Cursor, and…
Inside Typeform’s AI Agent Stack
Typeform is building generative AI experiences to help customers create better forms faster and to make collecting insights feel more natural and useful end-to-end. In this Q&A, Marta Lorens, Senior…
CUGA Agent: From Benchmarks to Business Impact of IBM’s Generalist Agent
This paper reading features several of the researchers — including Segev Shlomov (PhD), Ido Levy, Asaf Adi, and Avi Yaeli — behind the widely acclaimed paper “From Benchmarks to Business…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
Top Generative AI Conferences In 2026 for Engineers
GenAI stacks are shifting fast enough that staying current is an ongoing project, not a quarterly refresh. The hard part is separating durable engineering practices (evals, reliability, cost controls, security)…
New In Arize AX: January 2026 Updates
Arize AX pushed out a lot of new updates in January 2026. From improved evaluator hub to custom prompt release labels, here are some highlights. Evaluator Hub: Reusable Evaluators We’re…
How Nebulock Democratizes Threat Hunting
Nebulock is on a mission to democratize threat hunting. Instead of relying only on deterministic rules or reacting to alerts as they come in, the team builds AI agents that…
Why AI Agents Break: A Field Analysis of Production Failures
As AI agents enter production environments, they face conditions their training does not cover. These systems generate fluent output, yet operational work demands exact action. Small ambiguities compound fast when…
OWASP Top 10 for Agentic Applications: Compliance Guide
This guide maps the OWASP Agentic Security Initiative (ASI) top ten risks to specific Arize AX observability features and metrics you should implement to detect, monitor, and mitigate threats in…
Hierarchical Memory Management In Agent Harnesses
We’ve worked with thousands of customers building AI agents, and we’ve also spent the last two years building our own agent, Alyx, an in-product assistant for Arize AX. These experiences…
How Observability-Driven Sandboxing Secures AI Agents
AI agents become dangerous at the moment they gain the ability to execute actions. The moment an agent can touch the file system or invoke external tools, safety shifts from…