Automated evaluation platform for AI-powered applications. Generate diverse test inputs and score the responses with LLM judges evaluated for reliability.
For both setup and development, direct your agent to AGENTS.md.
git clone https://github.com/govtech-responsibleai/kaleidoscope.git
cd kaleidoscope
cp .env.example .env # add your LLM API key — see [Providers](#providers)
docker compose up -d # log in: dev / devHead to http://localhost:3000 to view your app.
A default admin user (dev / dev) is created on first startup. Add more users via the admin panel once logged in.
Connect any LLM application — point Kaleidoscope at any HTTP endpoint. Your chatbot, RAG pipeline, or custom API becomes the evaluation target with no code changes required.
Define custom rubrics — write scoring criteria tailored to your use case. Evaluate dimensions like accuracy, tone, safety, or any domain-specific quality you care about.
Generate diverse evaluation questions — create user personas with Singapore contextualisation and generate realistic questions across types (typical/edge) and scopes (in-KB/out-of-KB).
Annotate with judge assistance — claims and full responses are highlighted with judge reasoning. Human annotation in one click.
Measure judge reliability — evaluate answers with multiple LLM judges for comparison. Judge reliability is calculated from human annotations. Only reliable judges contribute to aggregated scores.
Kaleidoscope uses LiteLLM — any provider LiteLLM supports works out of the box. Add the relevant key to .env and you're set:
| Provider | Env var |
|---|---|
| Gemini | GEMINI_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| Azure OpenAI | AZURE_API_KEY + AZURE_API_BASE |
| AWS Bedrock | AWS_BEARER_TOKEN_BEDROCK |
| OpenRouter | OPENROUTER_API_KEY |
| Fireworks | FIREWORKS_AI_API_KEY |
Default models and the full list live in backend/src/common/llm/provider_catalog.yaml — add your own there.
Stack: FastAPI + SQLAlchemy + LiteLLM (Python 3.13, uv) · Next.js 16 + React 19 + MUI v7 (TypeScript) · PostgreSQL
Non-dev / full stack:
git clone https://github.com/govtech-responsibleai/kaleidoscope.git
cd kaleidoscope
docker compose up -dDev (recommended):
docker compose up -d db backend # db + backend in Docker
cd frontend && npm run dev # frontend locally with hot reload| Service | URL |
|---|---|
| Frontend | http://localhost:3000 |
| Backend API | http://localhost:8000 |
| API docs | http://localhost:8000/docs |
Docker reference: DOCKER.md
Subsystem docs: Backend | Frontend
cd backend && uv run pytest tests/ # backend (unit + integration)
cd frontend && npm run test:ui # frontend (Playwright, no backend needed)Configure your images in docker-compose.yml and the backend/Dockerfile / frontend/Dockerfile.
Before deploying to production rotate the dev secrets to strong random values:
# Run twice — once for JWT_SECRET_KEY, once for ADMIN_API_KEY
cd backend && uv run python scripts/generate_secret.pySet the outputs in .env or your deployment environment.
Nemotron dataset: The first call to sample personas downloads the configured NVIDIA Nemotron dataset and caches it to
~/.cache/huggingface/. Expect time and disk on first run — subsequent calls are instant. See Customising personas to change the dataset.
Three options to customise personas for test case generation:
- AI personas
- Upload personas
- Sample from NVIDIA's Nemotron Personas.
By default, the platform users Nemotron-Personas-Singapore (~148K rows). To use a different country dataset, set NEMOTRON_PERSONAS_DATASET in .env:
# .env
NEMOTRON_PERSONAS_DATASET=nvidia/Nemotron-Personas-USAThe value must be a valid nvidia/Nemotron-Personas-* HuggingFace path. For adding style templates for other countries, see Backend README.
For Whole-of-Government (WOG) deployments there are two optional add-ons — both independent, pick what you need:
1. WOG providers — enable AIBots and other WOG-internal connectors:
# .env
KALEIDOSCOPE_EXTENSIONS=aibotsYou then select "aibots" during Target Application set-up. Full connector reference: backend/src/extensions/aibots/README.md
2. Singapore personas — keep NEMOTRON_PERSONAS_DATASET at its default (nvidia/Nemotron-Personas-Singapore). Recommended if you need general-purpose personas for a Singapore-context evaluation.
3. Singapore-contextualised generation prompts — the built-in LLM prompt templates (backend/src/common/prompts/templates/) are written for Singapore government and public-facing digital services (references to CPF, HDB, NS, .gov.sg, etc.). They work out of the box for WOG use cases.
To adapt for a different domain, edit the Markdown files in that directory. You can also customise the evaluation/judge prompts (accuracy_judge.md, checkworthy.md, *_rubric_judge.md) to fit your use case.
Reach out to the AI Practice team for setup details.
Happy evaluating!
MIT