Pairwise LLM judges (A/B/tie): budget-aware multi-turn packing, position-bias correction, pseudo-label distillation. Generalized from the 4th-place (gold) solution to Kaggle LMSYS Chatbot Arena.
-
Updated
Jun 10, 2026 - Python
Pairwise LLM judges (A/B/tie): budget-aware multi-turn packing, position-bias correction, pseudo-label distillation. Generalized from the 4th-place (gold) solution to Kaggle LMSYS Chatbot Arena.
Academic Personal AI Infrastructure
Lumen — learner-owned AI education platform. Tell the AI what you want to learn: it builds you a private course in ~a minute, tutors you with course-scoped RAG + citations, and lets you share, clone & remix via a moderated catalog. BYOK, custom no-LangChain multi-agent orchestrator, golden evals in CI, MCP server. Live demo + public /eval.
LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog, Embabel, and any LLM client.
Evidence-first multi-agent debate skill: get a second opinion by pitting Codex × Claude Code (or GLM/DeepSeek/Qwen) to independently review, red-team & judge high-stakes code and architecture decisions.
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
AI evaluators CLI for your AI apps and Agents - Dynatrace AI Observability
Toward Expert-Level Medical Text Validation with Language Models
Production-grade Playwright + TypeScript QA framework with AI-powered testing, LLM-as-Judge evaluation, MCP server, 7 CLI agents, security fuzzing, CI/CD pipelines, Jira sync, and Slack reporting — zero-config, plug-and-play.
Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration
LabelHub: AI-assisted data labeling and review platform.
University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.
Agentic Extensible Test Automation Framework
Rubric-driven AI homework grading system built as a Claude Code Skill. Score student submissions with CoT reasoning, bias mitigation, and PDCA quality cycle.
The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.
🚀 Production-grade, full-stack agentic ecosystem engineered with DB-first approach using Turso & Drizzle, Generative UI built with Next.js & Server-Sent Events (SSE), a robust LLM-as-Judge evaluation suite. Fully containerized and orchestrated via Kubernetes & Helm, leverages modern DevOps with GitHub Actions & GHCR.io scalable cloud-native deploy
A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.
Demos for AI agent evaluation: LLM-as-judge, trajectory analysis, hallucination detection, cost benchmarks
Multi-agent research copilot that separates LLM generation from deterministic citation verification — HALLMARK F1-H 0.747 (full dev_public, N=1119). LangGraph pipeline with a failure-driven controller loop, evidence-grounded Critic, and local PDF RAG.
Synthesis-focused chemistry benchmark and evaluation package for agentic LLMs across reaction understanding, retrosynthesis, and route planning tasks.
Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."