llama-dash turns a self-hosted local inference box into an observable, policy-controlled AI gateway: one UI for model state, request history, API keys, routing rules, proxy metrics, and client setup. The implemented inference backend is currently llama-swap over llama.cpp.
It is the single public entrypoint for OpenAI-compatible and Anthropic-compatible clients. llama-dash owns proxy policy, logging, auth, routing, and backend normalization, your selected inference backend owns local model processes and inference when traffic is routed to local models.
OpenAI SDK / Claude Code / Continue / Open WebUI
│
▼
llama-dash :3000
dashboard · auth · logs · routing · metrics
│ │
▼ ▼
llama-swap :8080 direct /v1 upstreams
llama.cpp models · peers OpenAI · Anthropic
- Dashboard — live stats, sparklines, model timeline, upstream health, GPU monitoring, SSE-backed live refresh with ETag polling fallback, and update status.
- Model management — load/unload models, per-model stats, load history, config snippet.
- Request logging — every completed
/v1/*call is queued for SQLite logging with searchable UI, histogram, and detail view. - Transparent proxy — streaming SSE preserved, bounded body capture for logs, token counts scraped in-flight. OpenAI (
/v1/chat/completions) and Anthropic (/v1/messages) shapes both supported — for example, point Claude Code at llama-dash viaANTHROPIC_BASE_URLto proxy and track your Claude code usage as well. - API keys — per-key rate limits (RPM/TPM), model allow-lists editable from detail page, hashed at rest, per-key stats and model usage breakdown.
- Dashboard auth — Better Auth username/password and passkey session gate for the UI and
/api/*with first-visit signup;/v1/*proxy auth stays API-key based. - Policies — custom routing rules with real proxy enforcement for continue, model rewrite, and policy reject actions, plus explicit auth passthrough, direct HTTPS
/v1upstream targets, encrypted upstream credential injection, per-key system prompt injection, and global request size limits. - Attribution — configurable header mapping for client, end-user, and session metadata with setup examples for common clients.
- Request auditing — per-key usage tracking across all proxied calls.
- Prometheus metrics —
/metricsexposes proxy request, token, latency-window, queue, upstream, running-model, and GPU gauges. - GPU monitoring — NVIDIA, AMD, and Apple Silicon. VRAM, utilization, temp, power.
- Config editor — edit llama-swap
config.yamlin-browser with on-demand validation, enforced pre-save schema checks, and auto-reload. - Inference backend facade — backend health, model list/running state, lifecycle actions, logs, and config are capability-driven so future runtimes can be added without weakening the llama-swap experience.
- Endpoints — copyable base URL, API key selector, code examples for curl, Python, TypeScript, Home Assistant, Claude Code, opencode, Open WebUI, and more.
- Playground — supports chat, image, speech and transcribe.
- Give teammates API keys without exposing llama-swap directly on your network.
- See which models are running, which clients are using them, and where latency is coming from.
- Debug slow or failed requests with status, token usage, timing, routing, attribution, and upstream metadata in one place.
- Enforce model allow-lists, request size limits, model aliases, and routing rules before traffic reaches llama-swap/llama-cpp.
- Route Claude Code or other Anthropic clients through one observable gateway while preserving subscription/OAuth bearer flows.
- Keep Prometheus metrics and searchable SQLite request history for a single-box self-hosted AI stack.
Choose the compose file that matches your GPU vendor. Both setups use ./config/config.yaml for llama-swap config, ./models/ for model files, and expose llama-dash on http://localhost:3000.
cp config/config.example.yaml config/config.yaml # edit models
docker compose -f docker-compose.amd.yaml up -ddocker-compose.amd.yaml runs ghcr.io/mostlygeek/llama-swap:rocm, passes through /dev/kfd and /dev/dri, and also mounts /dev/dri into llama-dash so AMD GPU stats work in the dashboard.
cp config/config.example.yaml config/config.yaml # edit models
docker compose -f docker-compose.nvidia.yaml up -ddocker-compose.nvidia.yaml runs ghcr.io/mostlygeek/llama-swap:cuda and requests gpus: all for the llama-swap service. This requires the NVIDIA Container Toolkit on the host.
- Node 24+
- pnpm
- A reachable llama-swap instance
cp .env.example .env # edit INFERENCE_BASE_URL to point at your instance
pnpm install
pnpm db:migrate # creates data/dash.db
pnpm dev # http://localhost:5173Copy .env.example to .env and fill in the values.
| Variable | Default | Notes |
|---|---|---|
INFERENCE_BACKEND |
llama-swap |
Active inference backend. Only llama-swap is currently implemented. |
INFERENCE_BASE_URL |
http://localhost:8080 |
Inference backend base URL. No trailing slash. |
INFERENCE_INSECURE |
false |
Skip TLS verification for inference backend with self-signed certs. |
INFERENCE_CONFIG_FILE |
(empty) | Absolute path to the backend config file. Required for the llama-swap config editor. |
DATABASE_PATH |
data/dash.db |
SQLite file, relative to CWD. SQLite :memory: and file: URI paths are preserved for tests/special deployments. |
BETTER_AUTH_SECRET |
Secret for signing Better Auth session data; openssl rand -base64 33 |
|
BETTER_AUTH_URL |
inferred | Optional external base URL for Better Auth redirects/cookies. Set this to the public HTTPS origin when using passkeys outside localhost. |
CREDENTIAL_ENCRYPTION_KEY |
32+ character secret used to encrypt stored upstream provider credentials. Required before creating or injecting upstream credentials. |
See docs/2026_05_03_inference_backends.md for the backend abstraction, capability model, and future Ollama notes.
src/server/proxy/*— the/v1/*pass-through: streaming SSE preserved, proxy context/body snapshots kept isolated, bounded request/response capture for logs, token counts scraped from responses as they fly by, and one queued SQLite row per completed request.src/server/admin/*— the/api/*admin surface consumed by the UI, with grouped route modules undersrc/server/admin/routes/*for models, requests, config, keys, aliases, routing, upstream credentials, settings, and system health. JSON GET responses support conditional ETag polling,/api/eventsstreams lightweight dashboard events, and/api/log-eventsstreams llama-swap logs only while the Logs page is mounted.src/server/auth.ts— Better Auth setup for dashboard username/password and passkey sessions; protects UI and/api/*, not/v1/*. Signup is only allowed while no dashboard user exists.src/server/gpu-poller.ts— pollsnvidia-smi/rocm-smi/system_profilerevery 10s, caches result in memory, and publishes GPU-change events for live dashboard refresh. AMD APUs use GTT (not VRAM) for actual usable memory; Apple shows unified memory and core count when available.src/server/model-watcher.ts— polls the inference backend running-model capability every 15s, diffs state, writes load/unload events tomodel_eventstable, and publishes model-change events.src/server/inference/*— selected inference backend facade plus backend-specific adapters and hints.src/server/llama-swap/client.ts— typed client over llama-swap's HTTP API.src/server/db/*— Drizzle schema, SQLite initialization, and request/model-event indexes for common dashboard query paths. Apply migrations explicitly withpnpm db:migrate.src/server/metrics.ts— Prometheus text metrics for proxy requests, tokens, latency window gauges, queue depth/drops, upstream reachability, running models, and GPU gauges at/metrics.Dockerfile,prod-server.mjs,docker-compose.amd.yaml,docker-compose.nvidia.yaml— production container packaging for llama-dash by itself or bundled with llama-swap.src/routes/*— thin TanStack Start route entrypoints for/,/login,/models,/models/:id,/requests,/logs,/system,/playground,/config,/settings,/keys,/keys/:id,/attribution,/policies,/endpoints.src/features/*— feature-local page components and helpers grouped by route area (dashboard,requests,keys,models,playground, etc.).src/lib/queries.ts— TanStack Query hooks with SSE-driven cache invalidation for request, model, GPU, and system changes plus slow ETag polling fallback.
Route any Anthropic SDK (including claude-code) through llama-dash for logging, filtering, and per-request inspection. Supports Anthropic subscriptions. Traffic flows:
Claude Code ──► llama-dash :5173 (log + filter) ──► api.anthropic.com
Client config (~/.claude/settings.json):
{
"env": {
"ANTHROPIC_BASE_URL": "http://<llama-dash-host>:3000"
}
}Leave ANTHROPIC_AUTH_TOKEN unset when using subscription OAuth — Claude
Code manages the bearer itself and llama-dash passes it through unchanged.
In llama-dash, configure an explicit routing rule in Policies for /v1/*
target path or Claude source model names, using continue, passthrough auth, preserved client
Authorization, and direct target https://api.anthropic.com/v1. This
will result in all Anthropic requests being transparently proxied through
while logging all traffic in llama-dash and applying filters.
This project was developed with significant assistance from LLMs. Architecture decisions, implementation, and documentation were all shaped through human-AI collaboration.
MIT