The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.
-
Updated
Jun 8, 2026 - Python
The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
VLA ≠ VLM. Side-by-side viewer running NVIDIA Alpamayo R1 (vision-language-action) alongside Qwen2.5-VL (vision-language) on the same 44-sec SF dashcam clip at 5 Hz. 220 paired traces. Surfaces what an action-trained model sees that a scene-trained model doesn't, and vice versa.
AI content engine using an anxiety-indexed behavioral science KB, multi-stage LangGraph pipeline, and calibrated LLM-as-judge evaluation harness
Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild
An LLM-powered training-evaluation platform that scores open-ended scenario responses 0 to 10 against rubrics, with an evaluation harness that benchmarks the AI scorer against human-labelled scores.
Enterprise RAG lab using AWS Bedrock, Snowflake, MuleSoft, Python, and an evaluation harness for regulated lending scenarios.
Evaluation harness for domain-specific RAG and QA systems with benchmark datasets, scoring, and regression workflows.
DoE Project
frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.
Authority-aware RAG evaluation for industrial manual questions
Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.
Prompt-evaluation toolkit: run golden-case prompts, route models, track cost, and leaderboard.
A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.
AI-agent evaluation harness with rubrics, regression datasets, deterministic checks, and reports.
Production-shaped DV agent evaluation harness with simulator adapter boundary, trajectory scoring, reward decomposition, and JSONL trace persistence.
Runnable benchmark toolkit for monophonic ABC melody generation and editing.
Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regression gates for CI, Phoenix/Langfuse exporters. Built for intent classifiers but works on any classification task.
Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."