Type a one-sentence learning goal — an AI orchestrator builds you a private course in ~50 seconds, a RAG tutor with citations teaches it, and you can audit every agent decision it made.
Custom multi-agent system, no LangChain · public evals with the weak scores kept in · live in production
Live demo · Eval results · Architecture · MCP server
Real production recording (Groq Llama 3.3 70B). Intake trimmed 6×, the ~50 s build 16× — the brief and the finished course are real-time.
Try it yourself: the one-click demo pre-fills demo@lumen.test / Demo!2026 and drops you into the tutor (free-tier box — give a cold page a few seconds).
A learner-owned, two-role e-learning platform — every signed-in user runs the whole loop themselves; admin only moderates and configures. The product is the loop; the point of the repo is the agentic system underneath it.
| Step | What happens |
|---|---|
| Define | A guided AI intake (capped at six turns) turns a fuzzy goal into a structured learning brief — the source goal is field-encrypted at rest |
| Build | The authoring orchestrator builds a private course from the brief — honest status, no half-finished partials, re-runnable, cancellable (build.py, the durability/idempotency/quota shell) |
| Learn | A course-scoped RAG tutor answers with lesson citations and a visible tool-call trace |
| Share | Publishing stays private; public listing is an explicit share + admin moderation state machine with an immutable audit trail |
| Clone | Any listed course can be remixed into your own draft, with server-written "Based on …" provenance and a sanitized export (no enrollments, traces, or soft-deleted content) |
| BYOK | Bring your own model key (OpenAI / Anthropic / Groq / Mistral) — allowlisted providers, server-owned base URLs, envelope-encrypted write-only keys |
Shipped to production as 2.0.0-two-role (CHANGELOG) — built as a gated waterfall: requirements → design → 6 ADRs → seven build streams, each cleared a Codex challenge, an independent Claude review, and a live in-browser walk before merge.
Every item below is on production today, with the code one click away.
The tutor picks per-turn among five sub-agents in tutor_subagents/ — retriever, web_searcher, code_runner, quiz_generator, concept_explainer — under a hard cap on tool-call rounds (streaming variant). The authoring side runs a six-stage pipeline — researcher → outliner → critic → reviser → lesson-drafter → final-critic — in authoring_orchestrator.py, capped at six revise/critic calls.
Retrieval is scoped per course and routed through a single ACL clause (visibility.py, ADR-0029) so private and cloned courses never leak chunks. Embeddings via Cloudflare Workers AI (bge-small-en-v1.5, 384-dim) into pgvector; answers cite specific lesson chunks.
Each LLM call logs prompt/completion tokens, USD cost, latency, and outcome to the llm_calls table (llm_call_log.py); each agent step lands in agent_tracer.py. Learners get a per-turn "show me how you got this" drill-down — planner steps, tool calls, retrieval audits with similarity scores; authors get a step-by-step build replay.
Three golden suites (30-item tutor, 10 authoring, 10 ingest) under evals/, judged 0–5 per axis, plus adversarial probes. A 3-item smoke gates every PR (workflow); results are public at /eval.
The point isn't the scores — it's the harness: LLM-as-judge applied honestly to one strong subsystem and two early ones, every number reproducible with make eval suite=… and smoke-gated in CI.
| Suite | Judged | LLM-judge score | Reading |
|---|---|---|---|
| Authoring | 10 / 10 | 3.85 / 5 | Strong — the headline number, backed by the raw JSONL |
| Tutor | 10 / 30 | 2.33 / 5 | Early — citation-format mismatch between the judge's expected citations and what the retriever pulls; 20 items skipped, cause documented |
| Ingest | 4 / 10 | 0.83 / 5 | Early — v1 chunker emits one module per video; 6 items failed upstream transcript fetch before judging |
The weak scores are published whole, on purpose. Methodology · raw reports.
app/mcp/ exposes nine tools (catalog, RAG tutor, FSRS reviews, AI authoring, ingest) over stdio + HTTP with OAuth client-credentials, published as io.github.ahmedEid1/lumen. Write tools gate on the can_author capability; URL ingest stays admin-only + flag-gated (ADR-0025).
Wiring it into an MCP client takes one JSON block:
Use it from Claude Desktop / Claude Code
// claude_desktop_config.json
{
"mcpServers": {
"lumen": {
"command": "python",
"args": ["-m", "app.mcp", "--transport", "stdio"],
"env": {
"LUMEN_MCP_AUTH_TOKEN": "<client-secret from `make mcp-token`>",
"DATABASE_URL": "postgresql+asyncpg://lumen:lumen@localhost:5432/lumen"
}
}
}
}Or for Claude Code: LUMEN_MCP_AUTH_TOKEN=<secret> claude mcp add lumen -- python -m app.mcp --transport stdio, then ask "list my Lumen courses" and watch list_courses, ask_tutor, create_course_draft, … fire. Full operator guide: docs/mcp.md.
Per-credential 256-bit DEKs wrapped by a versioned server KEK (secrets_crypto.py); decryption only inside the dispatch path — never in logs, traces, exports, or admin views. A prod boot guard refuses to start with stored credentials but no real KEK (ADR-0027). Request-count quotas close the $0-BYOK bypass of the dollar budget guard.
The two-role rebuild migrated student/instructor → user live: widen accepted roles → backfill → drop old values only after access tokens drained their 15-min TTL. The Alembic chain enforces one phase per run, and migration-level evidence gates refuse to tighten constraints over un-backfilled rows — one of them caught a real prod data condition (85 legacy chunks with NULL embedding model) at deploy time.
The short version: one FastAPI service owns all invariants, the agents are plain service-layer code, and every LLM call crosses one provider seam and one cost meter.
flowchart LR
user([User · authors + learns])
admin([Admin · moderates + config])
subgraph App[Application]
web[Next.js 15 · RSC]
api[FastAPI · Python 3.13<br/>capability-based authz<br/>central is_publicly_listed authorizer]
worker[Celery worker + beat]
end
subgraph Agents[Agent layer]
define[Goal intake → brief]
authoring[Authoring orchestrator<br/>researcher → … → final-critic]
tutor[Tutor orchestrator<br/>+ 5 sub-agents]
end
subgraph Data[Data]
pg[(Postgres 17 + pgvector)]
redis[(Redis 7)]
s3[(MinIO S3)]
end
subgraph LLM[Swappable LLM layer]
provider{LLM_PROVIDER dispatch}
groq[Groq · Llama 3.3 70B]
byok[BYOK allowlist<br/>OpenAI · Anthropic · Mistral · Groq]
end
subgraph Eval[Eval loop]
golden[(Golden datasets)]
judge[LLM-as-judge 0–5]
meter[llm_calls<br/>tokens · cost · latency]
end
user --> web --> api --> pg
admin --> web
api --> redis & s3 & worker
api --> define --> authoring
api --> tutor --> pg
mcp[MCP server · 9 tools] --> api
authoring & tutor --> provider
provider -.platform.-> groq
provider -.per-user.-> byok
api --> meter
golden --> judge --> provider
Provider-agnostic by env var — the live demo runs Groq's free tier; users dispatch on their own allowlisted keys. Every call crosses the cost meter, so budgets, quotas, and observability behave identically across providers. Full topology: docs/architecture.md.
Stack: Python 3.13 · FastAPI · async SQLAlchemy 2 · Alembic · Celery — Next.js 15 · React 19 · TypeScript 5 · Tailwind 4 · TanStack Query — PostgreSQL 17 (pgvector + tsvector) · Redis 7 · MinIO — Docker Compose · GitHub Actions · Trivy + CodeQL + gitleaks · Caddy 2.
The process is the portfolio as much as the code. Every build stream cleared three gates before merge:
- Codex challenge — a second-brain CLI attacks the design, plan, and code; findings triaged and resolved.
- Independent Claude review — a gating review subagent re-checks against source until clean.
- Live evidence — drive the app as a real user in a browser, locally and on prod, on top of unit/E2E/a11y suites. Running-the-app evidence is required, not optional.
At the 2.0.0 release: backend 1,421 tests / frontend 468 tests, all green; WCAG 2.2 AA axe-core gate (11 surfaces, 0 violations); en + ar i18n parity; visual-regression baselines; Playwright E2E on Chromium and WebKit. The UI itself went through a 20-loop redesign (30+ Radix-backed primitives, ⌘K command palette, dark/light themes) with five in-loop Codex rescue passes plus a final Codex review. A green main auto-deploys to production.
Prereqs: Docker Desktop 4.30+ (or Engine 27 + Compose v2).
git clone https://github.com/ahmedEid1/lumen.git
cd lumen
cp .env.example .env
make up && make migrate && make seedOpen http://localhost:3000 and sign in:
| Role | Password | |
|---|---|---|
| admin | admin@lumen.test |
Admin!2026 |
| user | teacher@lumen.test |
Teach!2026 |
| user | student@lumen.test |
Learn!2026 |
Without an LLM key the AI features fall back to a deterministic noop provider — the rest of the app still works. For the real thing (define/build, tutor, evals), a free Groq key is enough:
LLM_PROVIDER=openai
OPENAI_API_BASE=https://api.groq.com/openai/v1
OPENAI_API_KEY=<your-groq-key>
LLM_MODEL=llama-3.3-70b-versatileThe same LLMProvider abstraction takes native Anthropic or OpenAI by env var — no code changes. Feature flags (FEATURE_BYOK_ENABLED, FEATURE_PRIVATE_PUBLISH_ENABLED, CLONE_ENABLED, FEATURE_TUTOR_STREAMING) default off; set them in .env once their prerequisites (e.g. a real BYOK master key) are in place. make demo-seed adds the richer agentic-demo bundle.
More screenshots — dashboard, catalog, the agent-replay home page, the public eval page, a freshly built course, the brief review
- Live at lumen.ahmedhobeishy.tech — 2.0.0-two-role, shipped 2026-06-06. Free tier end-to-end: Groq + Cloudflare Workers AI + one AWS t4g.small (runbook: docs/deployment/aws-vps.md). Budget guards and request quotas cap spend; expect free-tier latency under load.
- Email verification is off in prod (
EMAIL_ENABLED=false— no SMTP configured). A flag, not a code limitation. - Eval scores include the weak ones, with causes documented — see the table above and docs/eval/ for raw JSONL.
- Test counts are release-time snapshots (the suites keep moving with
main); CI is the live source of truth.
Ahmed Hobeishy — AI / Agent Engineer in Essen, Germany. Lumen started as a 2020 Django side-project; five years and one model revolution later it's the centrepiece of my agentic-AI work: agents that are measured (golden evals, LLM-as-judge), auditable (per-call cost/latency traces, citation checks), and shipped (live, CI-gated, self-hostable).
Open to AI / Agent Engineer roles in Germany where evaluation and observability are first-class.