Skip to content

langeval/langeval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

LangEval - Enterprise AI Agent Evaluation Platform

English | Tiแบฟng Viแป‡t

License Status PRs Welcome Documentation

LangEval is an enterprise-grade AI Agentic Evaluation Platform, pioneering the application of Active Testing and User Simulation strategies to ensure the quality, safety, and performance of Generative AI systems before they reach the market.

Tip

๐Ÿš€ Live POC Demo: Explore the platform in action at langeval.space

Unlike passive monitoring tools that only "catch errors" after an incident has occurred, LangEval allows you to proactively "attack" (Red-Teaming), stress-test, and evaluate Agents in a safe Sandbox environment.


๐Ÿ“‘ Table of Contents

  1. Why Choose LangEval?
  2. Core Features
  3. Detailed Installation Guide
  4. Contributing
  5. Support the Project
  6. System Architecture
  7. Technology Stack
  8. Project Structure
  9. Development Roadmap
  10. Reference Documentation
  11. License

๐Ÿ’ก Why Choose LangEval?

In the era of Agentic AI, traditional evaluation methods (based on text similarity) are no longer sufficient. LangEval addresses the toughest challenges in Enterprise AI:

  • Behavioral Evaluation (Behavioral Eval): Does the Agent follow business processes (SOP)? Does it call the correct Tools?
  • Safety & Security: Can the Agent be Jailbroken? Does it leak PII?
  • Automation: How to test 1000 conversation scenarios without 1000 testers?
  • Data Privacy: Runs entirely On-Premise/Private Cloud, without sending sensitive data externally.

๐Ÿš€ Core Features

1. Active Testing & User Simulation ๐Ÿงช

  • Persona-based Simulation: Automatically generates thousands of "virtual users" with different personalities (Difficult, Curious, Impatient...) using Microsoft AutoGen.
  • Multi-turn Conversation: Evaluates the ability to maintain context across multiple conversation turns, beyond simple Q&A.
  • Dynamic Scenarios: Flexible test scenarios supporting logical branching (Decision Tree).

2. DeepEval Integration & Agentic Metrics ๐Ÿค–

  • Tiered Metrics System:
    • Tier 1 (Response): Answer Relevancy, Toxicity, Bias.
    • Tier 2 (RAG): Faithfulness (Anti-hallucination), Contextual Precision.
    • Tier 3 (Agentic): Tool Correctness, Plan Adherence (Process compliance).
  • Custom Metrics: Supports defining custom metrics using G-Eval (LLM-as-a-Judge).

3. Orchestration with LangGraph ๐Ÿ•ธ๏ธ

  • State Machine Management: Manages complex states of the test process.
  • Self-Correction Loop: Automatically detects errors and retries with different strategies (Prompt Mutation) to find Agent weaknesses.
  • Human-in-the-loop: Breakpoint mechanisms for human intervention and scoring when the AI is uncertain.

4. Enterprise Security & Compliance ๐Ÿ›ก๏ธ

  • Identity Management: Pre-integrated with Microsoft Entra ID (Azure AD B2C) for SSO.
  • RBAC Matrix: Detailed permission control down to every button (Admin, Workspace Owner, AI Engineer, QA, Stakeholder).
  • PII Masking: Automatically hides sensitive information (Email, Phone, CC) starting from the SDK layer.

5. AI Studio & Comprehensive Dashboard ๐Ÿ“Š

  • Battle Arena: Compares A/B Testing between two Agent versions (Split View).
  • Root Cause Analysis (RCA): Failure Clustering to identify where the Agent frequently fails.
  • Trace Debugger: Integrated Langfuse UI to trace every reasoning step (Thought/Action/Observation).

๐Ÿšฆ Detailed Installation Guide

Prerequisites

  • Docker & Docker Compose (v2.20+)
  • Node.js 18+ (LTS) & npm/yarn/pnpm
  • Python 3.11+ (Optional, for running individual services locally)
  • Git

Step 1: Clone Repository

git clone https://github.com/your-org/langeval.git
cd langeval

Step 2: Configure Environment Variables

Copy the .env.example file to .env in the root directory and update essential keys.

cp .env.example .env

# Edit .env file and update:
# 1. OPENAI_API_KEY=sk-... (Required for Simulation Agents)
# 2. GOOGLE_CLIENT_ID=... (Required for Auth)
# 3. GOOGLE_CLIENT_SECRET=...
# 4. NEXTAUTH_SECRET=... (Generate with: openssl rand -base64 32)

Step 3: Start Backend & Infrastructure (Docker Compose)

We use Docker Compose to spin up the entire backend stack, including Databases (Postgres, ClickHouse, Qdrant), Message Queue (Kafka, Redis), and Core Services (Orchestrator, Resource Service).

# Start all backend services in the background
docker-compose up -d

Note: This process may take a few minutes to download images and initialize databases (PostgreSQL, Qdrant, ClickHouse). Ensure all containers are healthy before proceeding.

Step 4: Start Frontend (AI Studio)

Run the Next.js frontend application locally for the best development experience.

cd evaluation-ui

# Install dependencies
npm install

# Start the development server
npm run dev

Step 5: Access the Application

Once everything is running:


๐Ÿค Contributing

We adopt the Vibe Coding (AI-Assisted Development) process. We welcome contributions from the community!

Please carefully read CONTRIBUTING.md to understand how to use AI tools to contribute effectively and according to project standards.


โค๏ธ Support the Project

If you find LangEval useful, please consider supporting its development to help us maintain server costs and coffee supplies! โ˜•

Donate with PayPal

paypal.me/end2end8x


๐Ÿ—๏ธ System Architecture

LangEval adopts an Event-Driven Microservices architecture, optimized for deployment on Kubernetes (EKS) and horizontal scalability.

graph TD
    user(("User (QA/Dev)"))

    subgraph "LangEval Platform (EKS Cluster)"
        ui("AI Studio (Next.js)")
        api("API Gateway")
        
        subgraph "Control Plane"
            orch("Orchestrator Service<br>(LangGraph)")
            resource("Resource Service<br>(FastAPI)")
            identity("Identity Service<br>(Entra ID)")
        end
        
        subgraph "Compute Plane (Auto-scaling)"
            sim("Simulation Worker<br>(AutoGen)")
            eval("Evaluation Worker<br>(DeepEval)")
            gen("Gen AI Service<br>(LangChain)")
        end
        
        subgraph "Data Plane"
            pg[(PostgreSQL - Metadata)]
            ch[(ClickHouse - Logs)]
            kafka[(Kafka - Event Bus)]
            redis[(Redis - Cache/Queue)]
            qdrant[(Qdrant - Vector DB)]
        end
    end

    user --> ui
    ui --> api
    api --> orch & resource & identity
    
    orch -- "Dispatch Jobs" --> kafka
    kafka -- "Consume Tasks" --> sim & eval
    
    sim & eval -- "Write Logs" --> ch
    orch -- "Persist State" --> redis & pg
    gen -- "RAG Search" --> qdrant
Loading

๐Ÿ› ๏ธ Technology Stack

We select "Best-in-Class" technologies for each layer:

Layer Technology Reason for Selection
Frontend Next.js 14, Shadcn/UI, ReactFlow High performance, good SEO, standard Enterprise interface.
Orchestration LangGraph Better support for Cyclic Graphs compared to traditional LangChain Chains.
Simulation Microsoft AutoGen The most powerful framework currently available for Multi-Agent Conversation.
Evaluation DeepEval Deep integration with PyTest, supporting Unit Testing for AI.
Observability Langfuse (Self-hosted) Open Source, data security, excellent Tracing interface.
Database PostgreSQL, ClickHouse, Qdrant Polyglot Persistence: The right DB for the right job (Metadata, Logs, Vectors).
Queue/Stream Kafka, Redis Ensures High Throughput and Low Latency for millions of events.

๐Ÿ“‚ Project Structure

The project is organized using a Monorepo model for easy management and synchronized development:

langeval/
โ”œโ”€โ”€ backend/
โ”‚   โ”œโ”€โ”€ data-ingestion/      # Rust service: High-speed log processing from Kafka into ClickHouse
โ”‚   โ”œโ”€โ”€ evaluation-worker/   # Python service: DeepEval scoring worker
โ”‚   โ”œโ”€โ”€ gen-ai-service/      # Python service: Test data and Persona generation
โ”‚   โ”œโ”€โ”€ identity-service/    # Python service: Auth & RBAC
โ”‚   โ”œโ”€โ”€ orchestrator/        # Python service: Core logic, LangGraph State Machine
โ”‚   โ”œโ”€โ”€ resource-service/    # Python service: CRUD APIs (Agents, Scenarios...)
โ”‚   โ””โ”€โ”€ simulation-worker/   # Python service: AutoGen simulators
โ”œโ”€โ”€ evaluation-ui/           # Frontend: Next.js Web Application
โ”‚   โ”œโ”€โ”€ docs/                # ๐Ÿ“š Detailed project documentation
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ infrastructure/          # Terraform, Docker Compose, K8s manifests
โ””โ”€โ”€ ...

๐Ÿ—บ๏ธ Development Roadmap

The project is divided into 3 strategic phases:

Phase 1: The Core Engine (Q1/2026) โœ…

  • Build Orchestrator Service with LangGraph.
  • Integrate Simulation Worker (AutoGen) and Evaluation Worker (DeepEval).
  • Complete Data Ingestion pipeline with Kafka & ClickHouse.

Phase 2: The Studio Experience (Q2/2026) ๐Ÿšง

  • Launch AI Studio with Visual Scenario Builder (Drag & Drop).
  • Integrate Active Red-Teaming (Automated Attacks).
  • Human-in-the-loop Interface (Review Queue for scoring).

Phase 3: Scale & Ecosystem (Q3/2026+) ๐Ÿ”ฎ

  • Battle Mode (Arena UI) for A/B Testing.
  • Integrate CI/CD Pipeline (GitHub Actions Quality Gate).
  • Self-Optimization (GEPA algorithm for Prompt self-correction).

๐Ÿ“š Reference Documentation

The comprehensive documentation system (Architecture, API, Database, Deployment) is located in the evaluation-ui/docs/ directory. This is the Single Source of Truth.


๐Ÿ“„ License

This project is licensed under the MIT License. See the LICENSE file for more details.


LangEval Team - Empowering Enterprise AI with Confidence

About

No description, website, or topics provided.

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE.vi

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published