ai-evals

Star

Here are 45 public repositories matching this topic...

aisa-group / InferenceBench

Star

Benchmarking Open-Ended Inference Optimization by AI Agents

benchmarks ai-safety vllm sglang claude-code codex-cli ai-evals ai-research-automation

Updated May 16, 2026
Python

solana8800 / langeval

Sponsor

Star

Evaluation Infrastructure for AI Agents

ai-evaluation agent-evaluation ai-evals

Updated Feb 25, 2026
TypeScript

productfoundry101 / ai-evals-bootcamp

Star

Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.

bootcamp red-teaming rag prompt-engineering llmops ai-product-management llm-evaluation claude-code ai-pm ai-evals

Updated Jun 8, 2026

yiouli / pixie-qa

Star

Agent skill for AI agent development

skill dev eval llm agent-skills ai-evals

Updated Apr 22, 2026
HTML

mohsinsheikhani / property-maintenance-agent

Star

Eval-first AI agent that triages property maintenance emails. The real work is the eval system around it: trace-driven error analysis, code graders and validated LLM-as-judge (TPR/TNR), component and end-to-end evals, a failure taxonomy, and a CI regression gate. LangGraph, FastAPI, Langfuse.

python evaluation openai ai-agents pydantic fastapi ai-engineering prompt-engineering llmops langfuse llm-evaluation langgraph llm-as-a-judge llm-observability agentic-ai context-engineering ai-evals

Updated Jun 7, 2026
Python

RafaelParonis / jailbench

Star

🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.

python flask analytics openai alignment model-evaluation ai-safety security-testing red-teaming model-robustness anthropic litellm content-safety llm-jailbreaks tool-calling llm-benchmark ai-evals textual-tui

Updated Jun 11, 2026
Python

zaidazmi / AI-PM-PLAYBOOK

Star

Playbook for PMs shipping AI products with PRDs, evals, HITL, launch gates, cost, and observability.

prd human-in-the-loop ai-agents evals ai-product-management llm-evals vibe-coding ai-pm ai-evals

Updated May 26, 2026
TypeScript

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

danielrosehill / Awesome-AI-Evaluations-Tools

Star

Collection of frameworks and tools for AI evalations, including tool-use, agentic AI, MCP, and multimodal

evaluations evals ai-evals

Updated Jun 8, 2026
Python

udapy / rusty-llm-jury

Star

Rust based CLI tool for estimating success rates when using LLM judges for evaluation.

rust-lang evaluation-framework ai-evals

Updated Jun 7, 2026
Rust

SuperfiedStudd / ai-evals-orchestration

Star

End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.

gemini openai multi-model transcription human-in-the-loop model-comparison supabase anthropic llm-evaluation ai-evals evaluation-pipeline

Updated Mar 10, 2026
TypeScript

vitron-ai / aip-foundry-themis-starter

Star

Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.

react typescript schema-validation themis contract-testing osdk developer-tooling agentic-workflows ai-evals foundry-workflows

Updated Mar 28, 2026
TypeScript

yenklabs / Dali

Star

Dali is open evidentiary infrastructure for legal AI, focused on verifiable outputs, reproducible evaluations, and evidence artifacts.

open-source benchmark oss mcp provenance reproducibility legaltech evidence legal-ai ai-evaluation open-infrastructure ai-evals deterministic-ai legal-citations defensible-ai legal-infrastructure citation-integrity evidentiary-infrastructure

Updated Jun 4, 2026
Python

ishtiaqrahman / capitalbench

Star

Offline, auditable benchmark for one-shot LLM market decisions.

finance benchmark reproducibility llm-evaluation ai-evals capitalbench

Updated Jun 10, 2026
Python

majdukovic / job-radar

Star

AI-powered remote job aggregator with true-remote filtering, fuzzy skill match, and a working AI Evals Engineer portfolio (Hamel/Shreya methodology, calibrated judges, failure taxonomy).

typescript nextjs job-search posthog supabase ai-evaluation llm inngest prompt-engineering anthropic ai-evals

Updated May 14, 2026
HTML

alexcatdad / scenario-eval-harness

Star

Scenario evaluation harness for AI workflows with structured reports and traces.

testing typescript reports llm ai-evals

Updated Jun 9, 2026
TypeScript

mborges-dev / extraction-evals

Star

Reproducible benchmark for LLM-based structured extraction from documents. Compare Claude / GPT / Gemini / open-weight on the same task with cost + latency tracking.

python benchmark openai portuguese claude structured-extraction document-extraction pydantic anthropic llm-evals ai-evals

Updated Jun 9, 2026
Python

IsaacCavallaro / agent-evals-workbench

Star

A lightweight workbench for dataset-driven agent and LLM evaluation.

python cli regression-testing llm-evals agent-evals openai-compatible ai-evals eval-harness

Updated May 1, 2026
Python

vishal-labade / llm_exp_platform_v2

Star

Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.

experimentation causal-inference product-analytics llm-evaluation llm-benchmarking ai-evals

Updated Mar 8, 2026
Python

AlejandroFuentePinero / ai-jie

Star

LLM extraction pipeline for job postings; ~80% skill classification accuracy via chain-of-thought scaffolding.

structured-output pydantic ayncio prompt-engineering ai-evals

Updated Apr 9, 2026
Python

Improve this page

Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evals

Here are 45 public repositories matching this topic...

aisa-group / InferenceBench

solana8800 / langeval

productfoundry101 / ai-evals-bootcamp

yiouli / pixie-qa

mohsinsheikhani / property-maintenance-agent

RafaelParonis / jailbench

zaidazmi / AI-PM-PLAYBOOK

vibheksoni / jailbench

danielrosehill / Awesome-AI-Evaluations-Tools

udapy / rusty-llm-jury

SuperfiedStudd / ai-evals-orchestration

vitron-ai / aip-foundry-themis-starter

yenklabs / Dali

ishtiaqrahman / capitalbench

majdukovic / job-radar

alexcatdad / scenario-eval-harness

mborges-dev / extraction-evals

IsaacCavallaro / agent-evals-workbench

vishal-labade / llm_exp_platform_v2

AlejandroFuentePinero / ai-jie

Improve this page

Add this topic to your repo