Build software better, together

DaoyuanLi2816 / pairjudge

Pairwise LLM judges (A/B/tie): budget-aware multi-turn packing, position-bias correction, pseudo-label distillation. Generalized from the 4th-place (gold) solution to Kaggle LMSYS Chatbot Arena.

nlp kaggle-competition lora preference-learning kaggle-solution gold-medal llm rlhf reward-model llm-as-judge chatbot-arena

Updated Jun 10, 2026
Python

jerry609 / PaperBot

Star

Academic Personal AI Infrastructure

research paper nextjs multi-agent arxiv scholar rag fastapi paper2code daily-paper llm llm-as-judge

Updated May 8, 2026
Python

Lumen — learner-owned AI education platform. Tell the AI what you want to learn: it builds you a private course in ~a minute, tutors you with course-scoped RAG + citations, and lets you share, clone & remix via a moderated catalog. BYOK, custom no-LangChain multi-agent orchestrator, golden evals in CI, MCP server. Live demo + public /eval.

python docker postgres typescript mcp nextjs e-learning celery observability rag fastapi llm byok pgvector evals llm-as-judge ai-tutor agentic-ai

Updated Jun 7, 2026
Python

dokimos-dev / dokimos

Star

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog, Embabel, and any LLM client.

Updated Jun 10, 2026
Java

zhjai / agent-arena

Star

Evidence-first multi-agent debate skill: get a second opinion by pitting Codex × Claude Code (or GLM/DeepSeek/Qwen) to independently review, red-team & judge high-stakes code and architecture decisions.

opencode multi-agent code-review red-team codex ai-agents rag architecture-review openai-codex prompt-engineering llm-as-judge claude-code claude-code-skill agent-skill openclaw hermes-agent agent-arena evidence-checking deliberative-analysis

Updated Jun 9, 2026

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

dynatrace-oss / dt-evals

Star

AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability

ai evaluations agents observability evals llm-as-judge

Updated Jun 11, 2026
TypeScript

StanfordMIMI / MedVAL

Star

Toward Expert-Level Medical Text Validation with Language Models

medical-text llm-as-judge

Updated Oct 23, 2025
Python

nshportun / BestTester

Star

Production-grade Playwright + TypeScript QA framework with AI-powered testing, LLM-as-Judge evaluation, MCP server, 7 CLI agents, security fuzzing, CI/CD pipelines, Jira sync, and Slack reporting — zero-config, plug-and-play.

typescript mutation-testing mcp test-automation ci-cd allure-report api-testing owasp-zap security-testing page-object-model stryker e2e-testing github-actions qa-automation playwright jira-integration ai-testing aws-bedrock llm-as-judge

Updated Apr 21, 2026
TypeScript

johnsonfarmsus / openwebui-ab-mcts-pipeline

Star

Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration

docker machine-learning ai pipeline multi-model monte-carlo-tree-search research-software sakana llm reasoning-engine open-webui llm-as-judge advanced-reasoning open-webui-tools ab-mcts

Updated Oct 10, 2025
Python

gcomfident-crypto / Online-Label

Star

LabelHub: AI-assisted data labeling and review platform.

llm-as-judge ai-review llm-annotation

Updated Jun 11, 2026
TypeScript

edholofy / dojo.md

Star

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Updated May 2, 2026
TypeScript

WesleyPeng / agentic-taf

Star

Agentic Extensible Test Automation Framework

docker bdd selenium atdd requests ui-automation automation-framework paramiko chaos-engineering multi-layer-architecture playwright llm-as-judge

Updated Jun 1, 2026
Python

ChantillyAn / homework-grader

Star

Rubric-driven AI homework grading system built as a Claude Code Skill. Score student submissions with CoT reasoning, bias mitigation, and PDCA quality cycle.

education quality-control batch-processing claude excel-export rubric bias-mitigation anthropic llm-as-judge claude-code ai-grading claude-code-skill homework-grading

Updated Feb 22, 2026
Python

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

Star

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

reinforcement-learning machine-learning-algorithms language-model reward-design rft ai-training deeplearning-ai-courses ai-optimization multi-step-reasoning ai-evaluation rlhf llm-fine-tuning opensource-ai llm-as-judge predibase grpo llm-development token-level-control

Updated Jun 13, 2025
Jupyter Notebook

VinZCodz / llm-fullstack-ai-agentic-system

Star

🚀 Production-grade, full-stack agentic ecosystem engineered with DB-first approach using Turso & Drizzle, Generative UI built with Next.js & Server-Sent Events (SSE), a robust LLM-as-Judge evaluation suite. Fully containerized and orchestrated via Kubernetes & Helm, leverages modern DevOps with GitHub Actions & GHCR.io scalable cloud-native deploy

kubernetes express tdd nextjs helm monorepo server-sent-events cloud-native next-js vercel vitest llm drizzle-orm generative-ui llm-as-judge agentic-ai turso-db langraph agentic-devops

Updated Mar 21, 2026
TypeScript

Ufonia / wer-is-unaware

Star

A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.

healthcare-ai dspy llm-as-judge gepa

Updated Mar 5, 2026
Python

elizabethfuentes12 / how-to-evaluate-ai-agents-sample-for-aws

Star

Demos for AI agent evaluation: LLM-as-judge, trajectory analysis, hallucination detection, cost benchmarks

evaluation ai-agents cost-optimization opentelemetry hallucination-detection llm-as-judge agent-evaluation strands-agents

Updated May 21, 2026
Jupyter Notebook

2u39u4 / ResearchFlow

Star

Multi-agent research copilot that separates LLM generation from deterministic citation verification — HALLMARK F1-H 0.747 (full dev_public, N=1119). LangGraph pipeline with a failure-driven controller loop, evidence-grounded Critic, and local PDF RAG.

python nlp multi-agent openai arxiv agents rag research-assistant semantic-scholar llm llm-evaluation hallucination-detection langgraph llm-as-judge citation-verification

Updated Jun 8, 2026
Python

SergeiNikolenko / SynthLadder

Star

Synthesis-focused chemistry benchmark and evaluation package for agentic LLMs across reaction understanding, retrosynthesis, and route planning tasks.

benchmark chemistry cheminformatics mass-spectrometry retrosynthesis llm-as-judge agentic-ai pydantic-ai smolagents synthesis-planning openshell

Updated Jun 3, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-as-judge

Here are 177 public repositories matching this topic...

DaoyuanLi2816 / pairjudge

jerry609 / PaperBot

ahmedEid1 / lumen

dokimos-dev / dokimos

zhjai / agent-arena

minnesotanlp / cobbler

dynatrace-oss / dt-evals

StanfordMIMI / MedVAL

nshportun / BestTester

johnsonfarmsus / openwebui-ab-mcts-pipeline

gcomfident-crypto / Online-Label

edholofy / dojo.md

WesleyPeng / agentic-taf

ChantillyAn / homework-grader

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

VinZCodz / llm-fullstack-ai-agentic-system

Ufonia / wer-is-unaware

elizabethfuentes12 / how-to-evaluate-ai-agents-sample-for-aws

2u39u4 / ResearchFlow

SergeiNikolenko / SynthLadder

Improve this page

Add this topic to your repo