evals

Here are 338 public repositories matching this topic...

mastra-ai / mastra

From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

nodejs javascript typescript ai reactjs mcp nextjs tts chatbots workflows agents llm evals

Updated May 16, 2026
TypeScript

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated May 16, 2026
Python

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

agent ai openai evaluation-metrics mistral cost-estimation autogen groq agentops llm langchain anthropic evals ollama crewai agents-sdk openai-agents

Updated Mar 19, 2026
Python

Kiln-AI / Kiln

Star

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Updated May 15, 2026
Python

pydantic / logfire

Sponsor

Star

AI observability platform for production LLM and agent systems.

python ai metrics logging trace openai observability pydantic fastapi opentelemetry ai-tools ai-observability evals llm-observability pydantic-ai agent-observability

Updated May 15, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated May 16, 2026
Python

lmnr-ai / lmnr

Star

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

Updated May 16, 2026
TypeScript

harbor-framework / harbor

Star

Harbor is a framework for running agent evaluations and creating and using RL environments.

rl-environments evals terminal-bench

Updated May 16, 2026
Python

MCPJam / inspector

Sponsor

Star

Development platform to debug, chat, inspect, and evaluate MCP servers, MCP apps, and ChatGPT apps.

Updated May 16, 2026
TypeScript

mattpocock / evalite

Sponsor

Star

Evaluate your LLM-powered apps with TypeScript

typescript ai evals

Updated Apr 28, 2026
TypeScript

GitHamza0206 / simba

Star

OpenSource Production ready Customer service with built in Evals and monitoring

knowledge-base customer-service rag llm evals

Updated Jan 12, 2026
TypeScript

superlinear-ai / raglite

Star

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

markdown pdf postgres sqlite postgresql reranking rag vector-search duckdb colbert llm pgvector chainlit retrieval-augmented-generation evals late-interaction late-chunking query-adapter

Updated May 15, 2026
Python

future-agi / future-agi

Star

Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Guardrails. Self-hostable. Apache 2.0.

ai simulation observability llm evals ai-gateway

Updated May 16, 2026
Python

ombharatiya / ai-system-design-guide

Star

AI system design guide for engineers building production AI systems and evals.

aws machine-learning natural-language-processing azure gcp artificial-intelligence gemini llama interview-questions claude open-ai rag system-design-interview llm gen-ai evals agentic-workflow agentic-ai

Updated Apr 6, 2026

waynesutton / opensync

Star

Cloud-synced dashboards for OpenCode and Claude Code. Track sessions, search with semantic lookup, export eval datasets.

open-source ai sessions convex opensync dasbhoard evals

Updated Feb 23, 2026
TypeScript

YutoTerashima / agent-safety-eval-lab

Star

Agent trace and tool-use safety evaluation lab.

ai-agents red-teaming tool-use evals llm-safety

Updated May 2, 2026
Python

microsoft / promptpex

Star

Test Generation for Prompts

testing evaluations prompt-engineering llms chatgpt evals gpt-4o

Updated May 14, 2026
TeX

keshik6 / HourVideo

Star

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

navigation perception summarization reasoning visual-reasoning egocentric-videos gpt-4 multiple-choice-questions benchmark-dataset video-language-understanding multimodal-large-language-models evals gemini-pro spatial-intelligence neurips-2024 1-hour-video-language-understanding long-form-video-language-understanding long-context-understanding

Updated Jul 12, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Feb 15, 2026
TypeScript

mclenhard / mcp-evals

Star

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

ai mcp evals

Updated Jun 23, 2025
TypeScript

Improve this page

Add a description, image, and links to the evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals

Here are 338 public repositories matching this topic...

mastra-ai / mastra

Arize-ai / phoenix

AgentOps-AI / agentops

Kiln-AI / Kiln

pydantic / logfire

truera / trulens

lmnr-ai / lmnr

harbor-framework / harbor

MCPJam / inspector

mattpocock / evalite

GitHamza0206 / simba

superlinear-ai / raglite

future-agi / future-agi

ombharatiya / ai-system-design-guide

waynesutton / opensync

YutoTerashima / agent-safety-eval-lab

microsoft / promptpex

keshik6 / HourVideo

METR / vivaria

mclenhard / mcp-evals

Improve this page

Add this topic to your repo