Benchmarking Open-Ended Inference Optimization by AI Agents
-
Updated
May 16, 2026 - Python
Benchmarking Open-Ended Inference Optimization by AI Agents
Evaluation Infrastructure for AI Agents
Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.
Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Playbook for PMs shipping AI products with PRDs, evals, HITL, launch gates, cost, and observability.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal
Rust based CLI tool for estimating success rates when using LLM judges for evaluation.
End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.
Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.
Dali is open evidentiary infrastructure for legal AI, focused on verifiable outputs, reproducible evaluations, and evidence artifacts.
Offline, auditable benchmark for one-shot LLM market decisions.
AI-powered remote job aggregator with true-remote filtering, fuzzy skill match, and a working AI Evals Engineer portfolio (Hamel/Shreya methodology, calibrated judges, failure taxonomy).
Scenario evaluation harness for AI workflows with structured reports and traces.
Reproducible benchmark for LLM-based structured extraction from documents. Compare Claude / GPT / Gemini / open-weight on the same task with cost + latency tracking.
A lightweight workbench for dataset-driven agent and LLM evaluation.
Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.
LLM extraction pipeline for job postings; ~80% skill classification accuracy via chain-of-thought scaffolding.
Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.
To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."